The RNA-seq data in this project originates from sequencing done at the Genomics Centre of QMUL. The samples came from Prof. Filipe Cabreiro's lab (then UCL, now Imperial) from normal and neurexin-deficient worms - the work was mostly carried out by Peter Cooke (then Phd student). Other members in Filipe's lab had helped out at the time with supervision and RNA extraction. The sequencing centre did not achieve good ribosomal depletion as it transpired that they did not have the right kit for C.elegans. Hence, only medium/high expressed genes are seen in these data and ribosomal RNA reads must be removed before analysing the data further.
Aine did a differential gene expression analysis based on counting reads using HTseq count and doing differential expresion with DESEq2 and limma. However, if I remember well, she missed out ~ half the reads due to a problem in one of her shell scripts (when pooling together the lanes, I think she pulled only two out of the four available).
It is now recognised that gene expression analysis should be based on aggregating the results of transcript-based analysis. Below is a list of links that describe how this should be done:
The data comprise 64 FASTQ files in total. There are 8 samples, each split into 4 lanes and each being sequenced in paired-end mode, hence there are: 8 x 4 x 2= 64 files.All data is from reversely-stranded libraries i.e. read 2 is in the same direction as the transcript on the genome. - 4 normal samples, names starting with "N" (biological replicates) - 4 neurexin-deficient samples, names starting with "SG" (biological replicates).
All samples originated from 1-day old adult worms.
Below is an outline of the steps to analyse the RNA-seq data from Aine's project, starting with raw fastq data and ending with mapped reads to the C. elegans transcriptome using kallisto.
All code was run on the departmental server thoth. Where the variable $PROJECT is used below it refers to the following directory: /d/in16/u/ubcg71a/research/filipe/ .
Initial fastqc data quality analysis on raw data was carried out by both Aine and repeated by Krzysztof at some point. Results can be found here: /d/in13/u/kszkop01/worm_neurexin/fastqc/
Had I run it myself I would have used my bash script run_fastqc.sh for running fastQC on multiple files:
#!/bin/bash
# Runs FASTQC on all fastq.gz files found in given directory
# Run as:
# ./run_fastqc dir_of_fastq_files
timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1 #all output will be logged to logfile
FASTQC_DIR="/s/software/fastqc/v0.11.8/FastQC/fastqc"
echo "Running FASTQC using executable: $FASTQC_DIR"
for fastq_file in $1/*.fastq.gz;
do
echo "Fastq file: $fastq_file"
$FASTQC_DIR $fastq_file -o .
done
I summarised K's FASTQC results using multiQC:
module use -s /s/mm/modules
module load python/v2
multiqc /d/in13/u/kszkop01/worm_neurexin/fastqc/
Results are under: $PROJECT/multiqc_on_raw/
All samples have sequences of a single length (76 bp).
Diagnostic plots from FASTQC summarised with multiQC (in all these plots, files with R1 are coloured red, files with R2 are blue):
Alt text
Alt text
I saved the statistics table from multiqc and read it in here to summarise:
t.stats <- read.table(file="multiqc/stats.txt", header=TRUE, sep="\t")
#use tidyr's separate to split the first column
t.stats <- separate(t.stats, Sample.Name, into=c("sample", "sample_num", "lane", "read", "nothing"), sep="_", remove=FALSE, convert=FALSE)
summary(t.stats$M.Seqs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.100 4.400 4.750 4.791 5.125 5.900
#print statistics per group
t.stats %>% group_by(sample) %>% summarize(mean=mean(M.Seqs))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## sample mean
## <chr> <dbl>
## 1 N1 4.42
## 2 N2 5
## 3 N3 4.45
## 4 N4 4.42
## 5 SG1 5.25
## 6 SG2 4.22
## 7 SG3 4.9
## 8 SG4 5.65
t.stats %>% group_by(lane) %>% summarize(mean=mean(M.Seqs))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## lane mean
## <chr> <dbl>
## 1 L001 4.76
## 2 L002 4.66
## 3 L003 5
## 4 L004 4.74
t.stats %>% group_by(sample) %>% summarize(mean=mean(X..Dups))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## sample mean
## <chr> <dbl>
## 1 N1 82.4
## 2 N2 82.6
## 3 N3 83.4
## 4 N4 79.9
## 5 SG1 81.6
## 6 SG2 80.4
## 7 SG3 81.4
## 8 SG4 81.4
t.stats %>% group_by(lane) %>% summarize(mean=mean(X..Dups))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## lane mean
## <chr> <dbl>
## 1 L001 81.5
## 2 L002 81.4
## 3 L003 82.0
## 4 L004 81.6
t.stats %>% group_by(sample) %>% summarize(mean=mean(X..GC))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## sample mean
## <chr> <dbl>
## 1 N1 49
## 2 N2 48
## 3 N3 49
## 4 N4 48
## 5 SG1 48.5
## 6 SG2 48
## 7 SG3 49
## 8 SG4 49
t.stats %>% group_by(lane) %>% summarize(mean=mean(X..GC))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## lane mean
## <chr> <dbl>
## 1 L001 48.6
## 2 L002 48.6
## 3 L003 48.6
## 4 L004 48.6
I trimmed (adapters and poly(A) tails) and quality-filtered the reads using the program fastp :
Raw fastq data is under: $PROJECT/data/Data/170519_NS500784_0235_AHT2JYBGX2/Demulti/Data/Intensities/BaseCalls/GC-IN-6725/
To remove poly(A) tails and adapters, I ran the script under $PROJECT/fastp:
nohup ./run_fastp.sh ../data/Data/170519_NS500784_0235_AHT2JYBGX2/Demulti/Data/Intensities/BaseCalls/GC-IN-6725/ >& run_fastp.out &
Below is the run_fastp.sh script:
#!/bin/bash
# Runs fastp in PE mode for all .fastq.gz files in a directory
# Run as:
# ./run_fastp.sh directory_of_samples
timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1 #all output will be logged to logfile
FASTP_EXEC="/d/in7/s/fastp/fastp"
DIR=$1
ADAPTER_FILE="./adapters_all.fa"
echo "Running fastp using executable: $FASTP_EXEC"
for file in `ls $DIR/*_R1_*.fastq.gz`;
do
sample=${file/$DIR\//}
sample=${sample/_R1_001.fastq.gz/}
echo "Sample= $sample"
$FASTP_EXEC -i "$DIR/$sample"_R1_001.fastq.gz \
-I "$DIR/$sample"_R2_001.fastq.gz \
-o "$sample"_R1_001_trimmed.fastq.gz \
-O "$sample"_R2_001_trimmed.fastq.gz \
--adapter_fasta $ADAPTER_FILE \
-l 30 --trim_poly_x \
-h "$sample"_fastp.html -j "$sample"_fastp.json
done
NOTE: fastp by default uses the following quality filtering options:
Re-ran fastQC with the files in the fastp directory: ./run_fastp.sh ../fastp Results in the directory: $PROJECT/fastqc_after_fastp
Then re-ran multiqc and moved the .html file locally to save the plots shown below:
Alt text
Alt text
Alt text
Alt text
Alt text
Alt text
Results are under $PROJECT/fastqc_after_fastp/multiqc
Seeing that there were no strong lane effects, I merged all 4 lanes for each sample (this could have been done following mapping but there really was no evidence here to suggest a lane effect). The new files are in the $PROJECT/fastp_merged directory and the script run_fastq_merger.sh is given below:
#!/bin/bash
# Merges all fastq files for the same sample and same read orientation into single fastqc
# Run as:
# ./run_fastq_merger.sh directory_of_fastq_files sample_name sample_number
timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1 #all output will be logged to logfile
MERGED_DIR="/d/in16/u/ubcg71a/research/filipe/fastp_merged/"
cd $1
sample=$2
samplenum=$3
echo $(pwd)
echo Merging for sample $sample
cat "$sample"*"$samplenum"*R1_001*fastq.gz > $MERGED_DIR/"$sample"_"$samplenum"_R1.fastq.gz
cat "$sample"*"$samplenum"*R2_001*fastq.gz > $MERGED_DIR/"$sample"_"$samplenum"_R2.fastq.gz
./run_fastq_merger.sh ../fastp/ N1 S5
./run_fastq_merger.sh ../fastp/ N2 S6
./run_fastq_merger.sh ../fastp/ N3 S7
./run_fastq_merger.sh ../fastp/ N4 S8
./run_fastq_merger.sh ../fastp/ SG1 S1
./run_fastq_merger.sh ../fastp/ SG2 S2
./run_fastq_merger.sh ../fastp/ SG3 S3
./run_fastq_merger.sh ../fastp/ SG4 S4
Before carrying out the analysis with kallisto/sleuth, we will map the reads using STAR and carry out some exploratory analysis of the data.
Aine's work showed a huge number of reads mapping to rRNA.Here, we will allow them to map and exclude them later. During the kallisto run, we will remove them from the transcriptome before mapping to save hassle.
First, we need to create an index of the genome with STAR:
module load star/v2.7
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir . --genomeFastaFiles ../Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --sjdbGTFfile ../Caenorhabditis_elegans.WBcel235.99.gtf --genomeSAindexNbases 12
#(note that parameter genomeSAindexNbases must be scaled down to 12 from the default 14 because the genome
#is smaller than the human genome)
#The genome index is currently under: /d/in16/u/ubcg71a/research/genomes/c.elegans/ensembl_99/STAR_INDEX/
#Once the genome index is created we can map the files using a bash script:
nohup ./run_star.sh ../fastp/ list_files > & run_star.out &
#where list_files contains a list of all samples
#N1_S5
#N2_S6
#N4_S8
#N3_S7
#SG1_S1
#SG2_S2
#SG3_S3
#SG4_S4
The script run_star.sh is given below:
#!/bin/bash
# Runs STAR mapping on all samples in a given directory
# Samples must end in .fastq.gz and are given as a list in a file (second argument)
# Run as:
# run_star.sh directory_of_fastq_files list_of_samples
timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1 #all output will be logged to logfile
dir=$1
shift
#set location of executables
STAR_EXEC="/s/software/STAR/bin/Linux_x86_64/STAR"
numProc=5 #number of processors/threads to be used
genomeIndex="/d/in16/u/ubcg71a/research/genomes/c.elegans/ensembl_99/STAR_INDEX/" #directory for genome index files
for sample in `cat $1`;
do
echo "Running STAR on sample $sample (paired-end reads) ..."
inputFile1="$dir$sample"_R1.fastq.gz
inputFile2="$dir$sample"_R2.fastq.gz
STAR --runThreadN $numProc \
--runMode alignReads \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--genomeDir $genomeIndex \
--outFileNamePrefix "$sample"_ \
--readFilesIn $inputFile1 $inputFile2 \
--outReadsUnmapped Fastx \
--quantMode GeneCounts
done
echo "All done!"
STAR produces separate files with the unmapped reads (which we can use, for example, to map to circular RNA transcripts later) and it also produces gene counts. Ideally one should rely on transcript counts for differential gene expression; here we will use gene counts for the exploratory part of the analysis only.
I summarised the STAR output by running multiqc again. Results are under $PROJECT/star_mapping/multiqc/
I start by defining my own version of DESeq2's plotPCA() function so that I can define which PCs I will be plotting
#define my own pcaplot to allow other components besides PC1 and PC2 to be plotted
#the function itself is copied from DESEq2
myPlotPCA <- function (x, intgroup = "condition", pc1 = 1, pc2 =2, ntop = 500, returnData = FALSE)
{
rv <- genefilter::rowVars(assay(x)) #if using with SummarizedExperiment objects
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop,
length(rv)))]
pca <- prcomp(t(assay(x)[select, ]))
percentVar <- pca$sdev^2/sum(pca$sdev^2)
if (!all(intgroup %in% names(colData(x)))) {
stop("the argument 'intgroup' should specify columns of colData(dds)")
}
intgroup.df <- as.data.frame(colData(x)[, intgroup, drop = FALSE])
group <- if (length(intgroup) > 1) {
factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
else {
colData(x)[[intgroup]]
}
d <- data.frame(PC1 = pca$x[, pc1], PC2 = pca$x[, pc2], group = group,
intgroup.df, names = colnames(x))
if (returnData) {
attr(d, "percentVar") <- percentVar[c(pc1,pc2)]
return(d)
}
ggplot(data = d,
aes_string(x = "PC1", y = "PC2", color = "group"))+
geom_point(size=3) +
xlab(paste0("PC",pc1,": ", round(percentVar[pc1] * 100), "% variance")) +
ylab(paste0("PC",pc2,": ", round(percentVar[pc2] * 100), "% variance"))
}
gtf <- rtracklayer::import('genomes/Caenorhabditis_elegans.WBcel235.99.gtf')
gtf_df=as.data.frame(gtf)
t2g <- data.frame(target_id = gtf_df$transcript_id,
ens_gene = gtf_df$gene_id,
ext_gene = gtf_df$gene_name)
t2g <- t2g %>% drop_na()
t2g <- dplyr::distinct(t2g)
The gene counts are in the .tab files of STAR. I moved those locally and use code from: http://biostars.org/p/241602 to read in counts so they can be used with DESeq2.
library(ggplot2)
library(DESeq2)
## Loading required package: SummarizedExperiment
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply, rowsum
#First, read in the gene counts from STAR
number<- 4 #reverse-stranded library so counts should be in column 4
ff <- list.files( path = "./star_mapping", pattern = "*ReadsPerGene.out.tab$", full.names = TRUE )
counts.files <- lapply( ff, read.table, skip = 4 )
starcounts <- as.data.frame( sapply( counts.files, function(x) x[ , number ] ) )
ff <- gsub( "[_]ReadsPerGene[.]out[.]tab", "", ff )
ff <- gsub( "[.]/star_mapping/", "", ff )
colnames(starcounts) <- ff
row.names(starcounts) <- counts.files[[1]]$V1
rm(counts.files)
coldata <- read.table("metadata.txt", header=T)
rownames(coldata) <- coldata[,1]
coldata
## sample condition
## N1_S5 N1_S5 control
## N2_S6 N2_S6 control
## N3_S7 N3_S7 control
## N4_S8 N4_S8 control
## SG1_S1 SG1_S1 neurexin
## SG2_S2 SG2_S2 neurexin
## SG3_S3 SG3_S3 neurexin
## SG4_S4 SG4_S4 neurexin
#Check row and column names match in the same order
rownames(coldata) == colnames(starcounts)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
dds = DESeqDataSetFromMatrix(countData = starcounts,
colData = coldata,
design = ~ condition )
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet
## dim: 46904 8
## metadata(1): version
## assays(1): counts
## rownames(46904): WBGene00197333 WBGene00198386 ... WBGene00010967
## WBGene00014473
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#pre-filter to remove genes with too few counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
## class: DESeqDataSet
## dim: 14163 8
## metadata(1): version
## assays(1): counts
## rownames(14163): WBGene00015153 WBGene00002061 ... WBGene00010966
## WBGene00010967
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#check quickly the distribution of the other counts
quantile(rowSums(counts(dds)))
## 0% 25% 50% 75% 100%
## 10 51 364 2168 83054802
#and in more detail
quantile(rowSums(counts(dds)), probs=seq(0.9,1,0.005))
## 90% 90.5% 91% 91.5% 92% 92.5%
## 5149.00 5294.22 5505.68 5756.92 6024.16 6250.85
## 93% 93.5% 94% 94.5% 95% 95.5%
## 6603.22 6954.88 7311.76 7756.54 8130.50 8767.13
## 96% 96.5% 97% 97.5% 98% 98.5%
## 9355.20 10231.64 11092.54 12633.10 14999.88 18921.49
## 99% 99.5% 100%
## 26026.46 38536.55 83054802.00
vsd <- vst(dds, blind=TRUE)
#note that by default only the top 500 genes in terms of variance are used for this PCA plot
plotPCA(vsd, intgroup=c("condition"))
This initial PCA looks promising with good separation of the samples. However, the rRNAs remain in the counts as shown above and they could be skewing the results.
We need to remove all counts corresponding to rRNA transcript_biotype reads (there are 23) - the genome/transcriptome files are from Ensembl and can be found in this directory: /d/in16/u/ubcg71a/research/genomes/c.elegans
zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep biotype:rRNA > list_of_biotype_rRNA_transcripts.txt
#keep gene names only
awk '{print $4}' list_of_biotype_rRNA_transcripts.txt | sed 's/gene://' | sed 's/\..//' > list_biotype_rRNA_genes.txt
We will now move this list of rRNA genes locally and use it to filter out the unwanted counts.
library("pheatmap")
#clean up
rm(ff, coldata, starcounts, dds, vsd)
number<- 4 #reverse-stranded library so counts should be in column 4
ff <- list.files( path = "./star_mapping", pattern = "*ReadsPerGene.out.tab$", full.names = TRUE )
counts.files <- lapply( ff, read.table, skip = 4 )
starcounts <- as.data.frame( sapply( counts.files, function(x) x[ , number ] ) )
ff <- gsub( "[_]ReadsPerGene[.]out[.]tab", "", ff )
ff <- gsub( "[.]/star_mapping/", "", ff )
colnames(starcounts) <- ff
row.names(starcounts) <- counts.files[[1]]$V1
rm(counts.files)
coldata <- read.table("metadata.txt", header=T)
rownames(coldata) <- coldata[,1]
coldata
## sample condition
## N1_S5 N1_S5 control
## N2_S6 N2_S6 control
## N3_S7 N3_S7 control
## N4_S8 N4_S8 control
## SG1_S1 SG1_S1 neurexin
## SG2_S2 SG2_S2 neurexin
## SG3_S3 SG3_S3 neurexin
## SG4_S4 SG4_S4 neurexin
#Check row and column names match in the same order
rownames(coldata) == colnames(starcounts)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#now remove the rRNAs from the counts table
rRNAgenes <- read.table(file="list_biotype_rRNA_genes.txt", header=F)
rRNAgenes
## V1
## 1 WBGene00004677
## 2 WBGene00004512
## 3 WBGene00004513
## 4 WBGene00004567
## 5 WBGene00004622
## 6 WBGene00014454
## 7 WBGene00014472
## 8 WBGene00014621
## 9 WBGene00077465
## 10 WBGene00077466
## 11 WBGene00077467
## 12 WBGene00077468
## 13 WBGene00077469
## 14 WBGene00077470
## 15 WBGene00077471
## 16 WBGene00077472
## 17 WBGene00077473
## 18 WBGene00077474
## 19 WBGene00077475
## 20 WBGene00077476
## 21 WBGene00077477
## 22 WBGene00189966
## 23 WBGene00235197
starcounts.norRNA <- starcounts[!row.names(starcounts)%in%rRNAgenes[,1],]
dim(starcounts)
## [1] 46904 8
dim(starcounts.norRNA)
## [1] 46881 8
dim(rRNAgenes)
## [1] 23 1
rm(starcounts) #clean up
dds = DESeqDataSetFromMatrix(countData = starcounts.norRNA,
colData = coldata,
design = ~ condition )
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet
## dim: 46881 8
## metadata(1): version
## assays(1): counts
## rownames(46881): WBGene00197333 WBGene00198386 ... WBGene00010967
## WBGene00014473
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#pre-filter to remove genes with too few counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
## class: DESeqDataSet
## dim: 14156 8
## metadata(1): version
## assays(1): counts
## rownames(14156): WBGene00015153 WBGene00002061 ... WBGene00010966
## WBGene00010967
## rowData names(0):
## colnames(8): N1_S5 N2_S6 ... SG3_S3 SG4_S4
## colData names(2): sample condition
#check quickly the distribution of the other counts
quantile(rowSums(counts(dds)))
## 0% 25% 50% 75% 100%
## 10.00 51.00 363.00 2162.25 240054.00
#and in more detail
quantile(rowSums(counts(dds)), probs=seq(0.9,1,0.005))
## 90% 90.5% 91% 91.5% 92% 92.5% 93%
## 5137.500 5279.275 5485.100 5740.000 6012.600 6230.750 6569.650
## 93.5% 94% 94.5% 95% 95.5% 96% 96.5%
## 6916.775 7291.800 7736.800 8108.000 8734.175 9302.200 10166.250
## 97% 97.5% 98% 98.5% 99% 99.5% 100%
## 11065.100 12575.500 14494.600 18757.400 25663.900 37461.975 240054.000
#show the genes that still have very large counts
i<- rowSums(counts(dds)) >= 100000
assay(dds)[i,]
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## WBGene00000915 13312 12611 10603 15602 12740 10132 12254 14820
## WBGene00006929 18386 19388 18132 20877 18804 15398 18484 23763
## WBGene00006926 27452 26867 26905 28783 20753 17046 19889 25582
## WBGene00006930 29295 29066 28874 31287 30451 24642 29171 37268
## WBGene00001167 11896 10835 10303 14872 18855 12461 18973 21002
## WBGene00001168 21195 20607 20916 26436 23787 17016 25938 33677
#Check correlation between raw gene expression vectors for all samples (without log transformation)
pairs(assay(dds))
#and following log transformation
pairs(log(assay(dds)+0.5)) #add 0.5 as pseudocount
cor.mat <- cor(assay(dds))
cor.mat
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3
## N1_S5 1.0000000 0.9876924 0.9951246 0.9802646 0.9524706 0.9692726 0.9512352
## N2_S6 0.9876924 1.0000000 0.9900583 0.9869876 0.9472874 0.9698440 0.9392737
## N3_S7 0.9951246 0.9900583 1.0000000 0.9796024 0.9443960 0.9618507 0.9429706
## N4_S8 0.9802646 0.9869876 0.9796024 1.0000000 0.9705380 0.9718217 0.9649730
## SG1_S1 0.9524706 0.9472874 0.9443960 0.9705380 1.0000000 0.9881489 0.9968161
## SG2_S2 0.9692726 0.9698440 0.9618507 0.9718217 0.9881489 1.0000000 0.9817127
## SG3_S3 0.9512352 0.9392737 0.9429706 0.9649730 0.9968161 0.9817127 1.0000000
## SG4_S4 0.9579343 0.9485628 0.9532682 0.9725760 0.9954057 0.9829288 0.9972798
## SG4_S4
## N1_S5 0.9579343
## N2_S6 0.9485628
## N3_S7 0.9532682
## N4_S8 0.9725760
## SG1_S1 0.9954057
## SG2_S2 0.9829288
## SG3_S3 0.9972798
## SG4_S4 1.0000000
pheatmap(cor.mat, legend=T)
#This all looks good - Ns correlate better with each other than with SGs and vice versa.
#Transform the counts
vsd <- vst(dds, blind=TRUE)
plotPCA(vsd, intgroup=c("condition"))
#As expected the PCA is not changed much because we only removed a few genes
#try rlog
rld <- rlog(dds, blind = TRUE)
plotPCA(rld, intgroup=c("condition"))
#we can improve the plot a bit by highlighting the replicate number in different shape
pcaData <- plotPCA(vsd, intgroup=c("condition", "sample"), returnData=TRUE)
pcaData$replicate <- sub(pattern= ".*(.)_.*",
replacement="\\1",
perl=TRUE, fixed=FALSE,
x=pcaData$sample)
pcaData
## PC1 PC2 group condition sample name replicate
## N1_S5 -10.515292 3.9809806 control:N1_S5 control N1_S5 N1_S5 1
## N2_S6 -11.343047 -4.3937220 control:N2_S6 control N2_S6 N2_S6 2
## N3_S7 -10.735343 3.8215511 control:N3_S7 control N3_S7 N3_S7 3
## N4_S8 -8.883619 -3.2672766 control:N4_S8 control N4_S8 N4_S8 4
## SG1_S1 10.795760 -0.7681986 neurexin:SG1_S1 neurexin SG1_S1 SG1_S1 1
## SG2_S2 8.886207 -2.7062952 neurexin:SG2_S2 neurexin SG2_S2 SG2_S2 2
## SG3_S3 11.160175 1.7500821 neurexin:SG3_S3 neurexin SG3_S3 SG3_S3 3
## SG4_S4 10.635157 1.5828787 neurexin:SG4_S4 neurexin SG4_S4 SG4_S4 4
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=replicate)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
ggtitle(label="STAR counts, transformed with vsd")
pcaData <- myPlotPCA(vsd, pc1= 2, pc2= 3, intgroup=c("condition", "sample"), returnData=TRUE)
pcaData$replicate <- sub(pattern= ".*(.)_.*",
replacement="\\1",
perl=TRUE, fixed=FALSE,
x=pcaData$sample)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=replicate)) +
geom_point(size=3) +
xlab(paste0("PC2: ",percentVar[1],"% variance")) +
ylab(paste0("PC3: ",percentVar[2],"% variance")) +
coord_fixed()
#Check correlation between transformed gene expression vectors for all samples
pairs(assay(vsd))
cor.mat <- cor(assay(vsd))
cor.mat
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3
## N1_S5 1.0000000 0.9897970 0.9926442 0.9895144 0.9778167 0.9812096 0.9779625
## N2_S6 0.9897970 1.0000000 0.9904951 0.9911871 0.9752235 0.9819031 0.9726985
## N3_S7 0.9926442 0.9904951 1.0000000 0.9884828 0.9758348 0.9803706 0.9760524
## N4_S8 0.9895144 0.9911871 0.9884828 1.0000000 0.9836709 0.9840315 0.9817861
## SG1_S1 0.9778167 0.9752235 0.9758348 0.9836709 1.0000000 0.9919673 0.9944142
## SG2_S2 0.9812096 0.9819031 0.9803706 0.9840315 0.9919673 1.0000000 0.9898898
## SG3_S3 0.9779625 0.9726985 0.9760524 0.9817861 0.9944142 0.9898898 1.0000000
## SG4_S4 0.9789728 0.9741121 0.9774286 0.9830008 0.9947647 0.9907127 0.9951929
## SG4_S4
## N1_S5 0.9789728
## N2_S6 0.9741121
## N3_S7 0.9774286
## N4_S8 0.9830008
## SG1_S1 0.9947647
## SG2_S2 0.9907127
## SG3_S3 0.9951929
## SG4_S4 1.0000000
pheatmap(cor.mat, legend=T)
Next, we carry out differential gene expression based on the gene counts from STAR mapping. Note that this approach is now pretty much deprecated and replaced by transcript-based analysis summarised at the gene level so we won't really analyse the results here but add them here for completion.
#BiocManager::install("apeglm")
library(apeglm) #to use in the lfcShrink function
dds <- DESeq(dds) #estimates size factors and dispersions and fits the model
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
res
## log2 fold change (MLE): condition neurexin vs control
## Wald test p-value: condition neurexin vs control
## DataFrame with 14156 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## WBGene00015153 1.40026331133511 1.23369435011502 1.25556206618441
## WBGene00002061 1023.39358270591 -0.0561810129305647 0.111894106778005
## WBGene00001177 236.878096958743 0.645083433822109 0.130496798969703
## WBGene00021412 24.8706101313319 0.267994653237935 0.287508181787133
## WBGene00018976 6.03204079332923 0.0830540521723233 0.559216117300581
## ... ... ... ...
## WBGene00010963 646.235482854105 0.0560344400948453 0.532154321210568
## WBGene00010964 6610.73359986534 -0.301756164128778 0.339749774394959
## WBGene00010965 698.469898642392 -0.0714973118809531 0.592928262802426
## WBGene00010966 54.8204097454485 0.488445620632377 0.207379835476911
## WBGene00010967 2519.35135960664 0.508179713782424 0.122171381252483
## stat pvalue padj
## <numeric> <numeric> <numeric>
## WBGene00015153 0.98258332530239 0.325812554464328 NA
## WBGene00002061 -0.502090901373623 0.615603580096765 0.755603864950982
## WBGene00001177 4.94328932904997 7.68153514674162e-07 1.49941766583249e-05
## WBGene00021412 0.93212878872558 0.35126997168092 0.521797355147636
## WBGene00018976 0.148518702524594 0.881933427153093 0.933540011973405
## ... ... ... ...
## WBGene00010963 0.105297350526771 0.916139865263089 0.954440625434673
## WBGene00010964 -0.888171786621956 0.374448352556327 0.544673615403436
## WBGene00010965 -0.120583410112763 0.904021009760152 0.946326013043526
## WBGene00010966 2.35531877778328 0.0185068216532182 0.0590552767545967
## WBGene00010967 4.1595642823437 3.18855275260321e-05 0.000347870583450949
#next we want to shrink the LFC to help visualisation and ranking of genes
resultsNames(dds)
## [1] "Intercept" "condition_neurexin_vs_control"
resLFC <- lfcShrink(dds, coef="condition_neurexin_vs_control", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
## log2 fold change (MAP): condition neurexin vs control
## Wald test p-value: condition neurexin vs control
## DataFrame with 14156 rows and 5 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## WBGene00015153 1.40026331133511 0.0865693105221461 0.334873921280774
## WBGene00002061 1023.39358270591 -0.0501595379594478 0.106187975503463
## WBGene00001177 236.878096958743 0.6101360636599 0.131419234864905
## WBGene00021412 24.8706101313319 0.16047086366021 0.231989548013496
## WBGene00018976 6.03204079332923 0.0219340702385293 0.285509651993327
## ... ... ... ...
## WBGene00010963 646.235482854105 0.0165071893007835 0.281593365688285
## WBGene00010964 6610.73359986534 -0.15473391499059 0.25612671632153
## WBGene00010965 698.469898642392 -0.0161784771837534 0.289572855203541
## WBGene00010966 54.8204097454485 0.398418895186059 0.203231809685816
## WBGene00010967 2519.35135960664 0.476542635901622 0.122242317594528
## pvalue padj
## <numeric> <numeric>
## WBGene00015153 0.325812554464328 NA
## WBGene00002061 0.615603580096765 0.755603864950982
## WBGene00001177 7.68153514674162e-07 1.49941766583249e-05
## WBGene00021412 0.35126997168092 0.521797355147636
## WBGene00018976 0.881933427153093 0.933540011973405
## ... ... ...
## WBGene00010963 0.916139865263089 0.954440625434673
## WBGene00010964 0.374448352556327 0.544673615403436
## WBGene00010965 0.904021009760152 0.946326013043526
## WBGene00010966 0.0185068216532182 0.0590552767545967
## WBGene00010967 3.18855275260321e-05 0.000347870583450949
#MA plot
plotMA(resLFC, ylim=c(-2,2))
#contrast with the same plot but for the unshrunken log fold changes
plotMA(res, ylim=c(-2,2))
#plot top few genes
resLFCOrdered <- resLFC[order(resLFC$pvalue),]
resLFCOrdered[1:10,]
## log2 fold change (MAP): condition neurexin vs control
## Wald test p-value: condition neurexin vs control
## DataFrame with 10 rows and 5 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## WBGene00008862 222.57019613529 5.57679743055826 0.289049887335795
## WBGene00006541 816.161213344128 -1.93485791500237 0.102347519609402
## WBGene00016627 533.260899008354 -2.19269946224634 0.118338343000403
## WBGene00000657 369.382836201042 2.89687520175153 0.167550806546769
## WBGene00000712 519.893685556946 2.72103817505864 0.171840363835382
## WBGene00020218 263.048181538241 -2.09074236696147 0.132190159983885
## WBGene00016453 332.089024007193 -1.9220886364233 0.124163319245512
## WBGene00000703 450.872355408933 2.65300765292485 0.174067247594477
## WBGene00018393 711.229164789001 -1.68439569805028 0.117303671006414
## WBGene00022644 337.023507140299 1.58598181801307 0.115549105174495
## pvalue padj
## <numeric> <numeric>
## WBGene00008862 3.35503734507776e-83 4.47293578845766e-79
## WBGene00006541 6.37282902848922e-81 4.24812783039092e-77
## WBGene00016627 6.63788957158305e-78 2.94987812561151e-74
## WBGene00000657 2.60644403051886e-68 8.68727795371936e-65
## WBGene00000712 7.67188409824923e-58 2.04563117595718e-54
## WBGene00020218 1.40135503067076e-57 3.11381087815043e-54
## WBGene00016453 2.85659809132675e-55 5.44059510765261e-52
## WBGene00000703 7.94087824248563e-54 1.32334735911023e-50
## WBGene00018393 6.21843170000284e-48 9.21157015827087e-45
## WBGene00022644 5.18169923397144e-44 6.90824141873072e-41
rm(resLFC)
for (i in 1:10) {
plotCounts(dds, gene=rownames(resLFCOrdered)[i], intgroup="condition")
}
#devtools::install_github("rstudio/d3heatmap") #doesn't install otherwise
#BiocManager::install("pcaExplorer")
library(pcaExplorer)
#need a mapping from Ensembl ids to gene names; will use a t2g subset; row names must map gene ids
t2g.sub <- t2g[,c("ens_gene", "ext_gene")]
t2g.sub<- dplyr::distinct(t2g.sub)
rownames(t2g.sub) <- t2g.sub[,"ens_gene"]
colnames(t2g.sub) <- c("ens_gene", "gene_name")
#pcaExplorer(dds = dds, dst = vsd, annotation= t2g.sub)
#Conclusion: Good for quick exploration but the output downloaded plot for loadings did not agree with the one shown online so I'm not sure how buggy it is....(on another trial it worked...)
#Will explore loadings outside this...
Alt text
#BiocManager::install('PCAtools')
library(PCAtools)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: lattice
## Loading required package: cowplot
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
##
## Attaching package: 'PCAtools'
## The following objects are masked from 'package:stats':
##
## biplot, screeplot
p<- pca(assay(vsd), metadata=coldata)
screeplot(p)
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
biplot(p, colby="condition")
pairsplot(p, colby= "condition")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
plotloadings(p)
## -- variables retained:
## WBGene00008862, WBGene00020218, WBGene00006541, WBGene00016627, WBGene00021468, WBGene00000716, WBGene00000656, WBGene00195017, WBGene00195016, WBGene00010314, WBGene00271656, WBGene00010965, WBGene00045168, WBGene00077652, WBGene00016534, WBGene00013699, WBGene00016457, WBGene00010215, WBGene00001789, WBGene00044922, WBGene00017661, WBGene00009523, WBGene00011345, WBGene00011974, WBGene00022240, WBGene00303364, WBGene00009129
loadings(p)[which.max(abs(loadings(p)$PC1)),]
## PC1 PC2 PC3 PC4 PC5
## WBGene00008862 0.09067169 -0.008180751 0.03666657 0.003266629 -0.03149979
## PC6 PC7 PC8
## WBGene00008862 -0.000739828 0.0002908326 0.002464358
loadings(p)[which.max(abs(loadings(p)$PC2)),]
## PC1 PC2 PC3 PC4 PC5
## WBGene00010965 -0.0002111257 -0.07537175 -0.00877533 -0.006426331 -0.0006274245
## PC6 PC7 PC8
## WBGene00010965 0.004823226 -0.005803499 -0.009104016
Observations from PCA loadings plot:
Aine's work showed a huge number of reads mapping to rRNA. I've decided it's probably best to remove them BEFORE proceeding with anything else so we don't have to worry about them later. I saw that many people use bwa or bowtie2 for this but rRNAs of eukaryotes can also have introns so would need a spliced aligner. Other solutions; BBsplit from BBmap can split reads from different references but does it based on k-mers. SortMeRNA also removes reads that map to a set of sequences (used mostly with prokaryotes and metagenomes..would need to check if it works well with eukaryotes).
I decided in the end that the easiest solution might be to rebuild the index that kallisto uses and re-run kallisto so that the rRNAs are not part of the reference transcriptome, allowing the reads to remain unmapped (hopefully!).
We need to remove all sequences with rRNA transcript_biotype reads (there are 23) - the genome/transcriptome files are from Ensembl and can be found in this directory: /d/in16/u/ubcg71a/research/genomes/c.elegans
zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep biotype:rRNA > list_of_biotype_rRNA_transcripts.txt
Found a one-liner online to do this easily (basically, stitches the lines together separated by "", greps and then separates again with newlines:
zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' | grep -vFf pattern.txt - | tr "\t" "\n" > Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa
Checking:
grep ">" Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa | wc
61428
zcat Caenorhabditis_elegans.WBcel235.cdna_and_ncrna.fa.gz | grep ">" | wc
61451
wc list_of_biotype_rRNA_transcripts.txt
23
Then zip the final .fa file:
gzip -9 Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa
(all transcriptome files are under: /d/in16/u/ubcg71a/research/genomes/c.elegans See README for how they were obtained):
/d/in7/s/kallisto/kallisto_linux-v0.46.1/kallisto index -i Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.index Caenorhabditis_elegans.WBcel235.cdna_and_ncrna_norRNA.fa.gz
This blog: http://genomespot.blogspot.com/2015/08/how-accurate-is-kallisto.html suggests that the difference between trimming before kallisto or not trimming does not make a huge difference overall.
So I did not run Trimmomatic on top of fastp as I would have done if we were mapping with STAR or similar.
Running first without pseudoalignment (which takes a lot more time and space) but with bootstrapping (this is run in the directory $PROJECT/kallisto_with_ncRNA_norRNA/):
nohup ./run_kallisto.sh ../fastp_merged/ N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4 >& run_kallisto.out &
I got initially rather low mapping rates (about 20%) and was worried I had the orientation of the reads wrong. I reversed the orientation and that gave less than 1% alignment so that was clearly not the problem. In this forum: https://github.com/pachterlab/kallisto/issues/198 there is a suggestion that the culprit is ncRNAs that are not included in the transcriptome file, which would make sense since we know that 80% of the reads are to rRNA in this case.
I re-did the run with a transcriptome that had ncRNA too. Now, alignment jumped to 98.2% so the ncRNA missing was definitely the problem here (rRNA really..) These results were under kallisto_with_ncRNA, a directory that I eventually deleted to save space. The results without ncRNA are now renamed to kallisto_cDNA NOTE: the abundance estimates for the bootstrap are not written by default to the csv file but they can be written out using:
/d/in7/s/kallisto/kallisto_linux-v0.46.1/kallisto h5dump -o output abundance.h5
(where output is a directory where the results will be saved)
I will probably keep both versions and run sleuth with both to see whether there are any differences that affect the coding genes too.
Finally, I re-did the run with a transcriptome that had ncRNA but no rRNA. I then deleted the previous run that had ncRNA, including rRNAs to save space. So the kallisto_with_ncRNA directory has now been replaced with: kallisto_with_ncRNA_norRNA Alignment rates: p_unique: 9-10%, p_pseudoaligned: ~20%
I also did one re-run with ncRNA and no bootstrap but asking for pseudo-alignment to the genome so that we could get pseudo-bam files for visualisation: In directory: /d/in16/u/ubcg71a/research/filipe/kallisto_with_ncRNA_norRNA/pseudo_bam/
nohup ./run_kallisto_pseudobam.sh ../../fastp_merged/ N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4 >& run_kallisto.out &
This approach is following the code from: https://github.com/pachterlab/aggregationDE/blob/master/R/gene_pipeline and the procedure of aggregating p-values after transcript differential expression has been carried out to call differentially expressed genes (see paper by Yi et al: doi: 10.1186/s13059-018-1419-z).
'%!in%' <- function(x,y)!('%in%'(x,y))
setwd("~/Dropbox/work/research/Filipe_neurexin/NEW_PROCESS/")
#sample_id <- dir(file.path(".", "kallisto_cdna"))
sample_id <- dir(file.path(".", "kallisto_with_ncRNA_norRNA"))
sample_id
## [1] "N1_S5" "N2_S6" "N3_S7" "N4_S8" "SG1_S1" "SG2_S2" "SG3_S3" "SG4_S4"
#kal_dirs <- file.path(".", "kallisto_cdna", sample_id)
kal_dirs <- file.path(".", "kallisto_with_ncRNA_norRNA", sample_id)
Sleuth requires metadata information corresponding to the samples in the dataset. In this case, the metadata is very simple and was already created outside the R script. We simply need to add to this table the directories where kallisto results are saved.
s2c <- read.table("./metadata.txt", header = TRUE, stringsAsFactors=FALSE)
s2c
## sample condition
## 1 N1_S5 control
## 2 N2_S6 control
## 3 N3_S7 control
## 4 N4_S8 control
## 5 SG1_S1 neurexin
## 6 SG2_S2 neurexin
## 7 SG3_S3 neurexin
## 8 SG4_S4 neurexin
s2c <- dplyr::mutate(s2c, path = kal_dirs)
s2c
## sample condition path
## 1 N1_S5 control ./kallisto_with_ncRNA_norRNA/N1_S5
## 2 N2_S6 control ./kallisto_with_ncRNA_norRNA/N2_S6
## 3 N3_S7 control ./kallisto_with_ncRNA_norRNA/N3_S7
## 4 N4_S8 control ./kallisto_with_ncRNA_norRNA/N4_S8
## 5 SG1_S1 neurexin ./kallisto_with_ncRNA_norRNA/SG1_S1
## 6 SG2_S2 neurexin ./kallisto_with_ncRNA_norRNA/SG2_S2
## 7 SG3_S3 neurexin ./kallisto_with_ncRNA_norRNA/SG3_S3
## 8 SG4_S4 neurexin ./kallisto_with_ncRNA_norRNA/SG4_S4
We will get gene names from biomaRt to add to the transcript IDs we have from sleuth
ens <- useMart("ensembl")
attr = c("ensembl_gene_id",
"ensembl_transcript_id",
"description",
"external_gene_name",
"entrezgene_id")
celeg <- useMart("ensembl",
dataset = "celegans_gene_ensembl")
t2g <- getBM(attributes = attr,
mart = celeg)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Cache found
t2g <- dplyr::rename(t2g,
target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id,
ext_gene = external_gene_name)
We first carry out differential expression with sleuth, summarised at the gene level (using p-value aggregation).
#### GENE-LEVEL-ANALYSIS ####
#Following code from: https://github.com/pachterlab/aggregationDE/blob/master/R/gene_pipeline.R
design <- ~ condition
so <- sleuth_prep(s2c,
full_model = design,
read_bootstrap_tpm = TRUE,
extra_bootstrap_summary=TRUE,
target_mapping = t2g,
transform_fun_counts = function(x) log2(x + 0.5), #change the natural log fold change to log2 fold change
aggregation_column = 'ens_gene'
#filter_target_id=txfilter
)
## Warning in check_num_cores(num_cores): It appears that you are running Sleuth from within Rstudio.
## Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
## If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
## reading in kallisto results
## dropping unused factor levels
## ........
## normalizing est_counts
## 15845 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## ........
so <- sleuth_fit(so, ~condition, 'full')
## fitting measurement error models
## shrinkage estimation
## 4 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
## The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
## These are the target ids with NA values: T23E7.4a.2, C04F6.1.1, C42D8.2a.1, K07H8.6a.1
## computing variance of betas
#do a Wald test first
so <- sleuth_wt(so, which_beta='conditionneurexin', which_model='full')
sleuth_table_wt <- sleuth_results(so,
test= 'conditionneurexin',
test_type = 'wt',
which_model = 'full')
dim(sleuth_table_wt)
## [1] 46904 8
sleuth_table_wt[1:5,]
## target_id description
## 1 WBGene00016627
## 2 WBGene00008862
## 3 WBGene00020218
## 4 WBGene00016453 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5 WBGene00000657 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1 C44B7.5 174124 1 8.681924
## 2 F15D4.5 174969 1 5.984510
## 3 T04G9.7 180424 1 7.699620
## 4 vet-2 172993 1 8.088581
## 5 col-81 174686 1 7.883827
## pval qval
## 1 7.057950e-85 8.001598e-81
## 2 1.613360e-81 9.145333e-78
## 3 1.576815e-60 5.958785e-57
## 4 1.408001e-49 3.990628e-46
## 5 3.739522e-46 8.478992e-43
sleuth_significant_wt <- dplyr::filter(sleuth_table_wt, qval <= 0.05)
dim(sleuth_significant_wt)
## [1] 2328 8
head(sleuth_significant_wt, 20)
## target_id
## 1 WBGene00016627
## 2 WBGene00008862
## 3 WBGene00020218
## 4 WBGene00016453
## 5 WBGene00000657
## 6 WBGene00022644
## 7 WBGene00012557
## 8 WBGene00003977
## 9 WBGene00010158
## 10 WBGene00010808
## 11 WBGene00007196
## 12 WBGene00018605
## 13 WBGene00017648
## 14 WBGene00000301
## 15 WBGene00018138
## 16 WBGene00009897
## 17 WBGene00018393
## 18 WBGene00021005
## 19 WBGene00011432
## 20 WBGene00000703
## description
## 1
## 2
## 3
## 4 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7
## 8 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 10
## 11
## 12
## 13 D-aspartate oxidase 3 [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 14 Caveolin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 15 Folate-like transporter 2 [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 16 Peroxidase skpo-1 [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 17 Methionine Sulfoxide Reductase A [Source:UniProtKB/TrEMBL;Acc:O02089]
## 18 Uterine Lumin Expressed/locailized [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 19 SKN-1 Dependent Zygotic transcript [Source:UniProtKB/TrEMBL;Acc:O02297]
## 20 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q21556]
## ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1 C44B7.5 174124 1 8.681924
## 2 F15D4.5 174969 1 5.984510
## 3 T04G9.7 180424 1 7.699620
## 4 vet-2 172993 1 8.088581
## 5 col-81 174686 1 7.883827
## 6 dod-19 178564 1 8.195626
## 7 Y37D8A.19 176711 1 9.265287
## 8 pes-2.1 173025 1 6.538641
## 9 pes-2.2 173026 1 6.538641
## 10 sepa-1 173196 1 6.528977
## 11 B0513.4 178351 1 7.571620
## 12 F48E3.4 181006 1 8.878642
## 13 ddo-3 184746 1 8.419052
## 14 cav-1 177815 1 7.539687
## 15 folt-2 178745 1 10.118191
## 16 skpo-1 174340 1 9.551359
## 17 msra-1 185709 2 16.353021
## 18 ule-1 171778 1 8.688486
## 19 sdz-30 173198 1 5.798016
## 20 col-129 178155 1 8.242931
## pval qval
## 1 7.057950e-85 8.001598e-81
## 2 1.613360e-81 9.145333e-78
## 3 1.576815e-60 5.958785e-57
## 4 1.408001e-49 3.990628e-46
## 5 3.739522e-46 8.478992e-43
## 6 6.131492e-41 1.158545e-37
## 7 3.821684e-39 6.189491e-36
## 8 4.000704e-38 5.039554e-35
## 9 4.000704e-38 5.039554e-35
## 10 8.901306e-38 1.009141e-34
## 11 1.780045e-37 1.834579e-34
## 12 1.672903e-36 1.580475e-33
## 13 6.915372e-36 6.030737e-33
## 14 1.116744e-33 9.043235e-31
## 15 3.081094e-32 2.328690e-29
## 16 5.041138e-31 3.571961e-28
## 17 4.419597e-30 2.947351e-27
## 18 2.273631e-29 1.432008e-26
## 19 2.532942e-29 1.511367e-26
## 20 5.864490e-29 3.324286e-26
#and then a LRT test
so <- sleuth_fit(so, ~1, 'null')
## fitting measurement error models
## shrinkage estimation
## 4 NA values were found during variance shrinkage estimation due to mean observation values outside of the range used for the LOESS fit.
## The LOESS fit will be repeated using exact computation of the fitted surface to extrapolate the missing values.
## These are the target ids with NA values: T23E7.4a.2, C04F6.1.1, C42D8.2a.1, K07H8.6a.1
## computing variance of betas
so <- sleuth_lrt(so, 'null', 'full')
sleuth_table_lrt <- sleuth_results(so, test='null:full', 'lrt')
sleuth_table_lrt[1:5,]
## target_id
## 1 WBGene00006436
## 2 WBGene00008862
## 3 WBGene00016627
## 4 WBGene00018393
## 5 WBGene00020218
## description
## 1 Titin homolog [Source:UniProtKB/Swiss-Prot;Acc:G4SLH0]
## 2
## 3
## 4 Methionine Sulfoxide Reductase A [Source:UniProtKB/TrEMBL;Acc:O02089]
## 5
## ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1 ttn-1 266969 15 74.087167
## 2 F15D4.5 174969 1 5.984510
## 3 C44B7.5 174124 1 8.681924
## 4 msra-1 185709 2 16.353021
## 5 T04G9.7 180424 1 7.699620
## pval qval
## 1 5.755169e-09 1.631159e-05
## 2 3.530000e-09 1.631159e-05
## 3 5.541369e-09 1.631159e-05
## 4 2.585526e-09 1.631159e-05
## 5 8.194098e-09 1.857930e-05
sleuth_significant_lrt <- dplyr::filter(sleuth_table_lrt, qval <= 0.05)
dim(sleuth_significant_lrt)
## [1] 2251 8
head(sleuth_significant_lrt, 20)
## target_id
## 1 WBGene00006436
## 2 WBGene00008862
## 3 WBGene00016627
## 4 WBGene00018393
## 5 WBGene00020218
## 6 WBGene00016453
## 7 WBGene00013567
## 8 WBGene00022644
## 9 WBGene00009007
## 10 WBGene00009897
## 11 WBGene00018602
## 12 WBGene00000657
## 13 WBGene00007196
## 14 WBGene00017648
## 15 WBGene00010808
## 16 WBGene00000301
## 17 WBGene00003977
## 18 WBGene00010158
## 19 WBGene00010204
## 20 WBGene00010822
## description
## 1 Titin homolog [Source:UniProtKB/Swiss-Prot;Acc:G4SLH0]
## 2
## 3
## 4 Methionine Sulfoxide Reductase A [Source:UniProtKB/TrEMBL;Acc:O02089]
## 5
## 6 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 7
## 8 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 9 OTUBain deubiquitylating protease homolog [Source:UniProtKB/TrEMBL;Acc:Q19681]
## 10 Peroxidase skpo-1 [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 11
## 12 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 13
## 14 D-aspartate oxidase 3 [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 15
## 16 Caveolin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 17 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 18 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 19
## 20
## ext_gene entrezgene_id num_aggregated_transcripts sum_mean_obs_counts
## 1 ttn-1 266969 15 74.087167
## 2 F15D4.5 174969 1 5.984510
## 3 C44B7.5 174124 1 8.681924
## 4 msra-1 185709 2 16.353021
## 5 T04G9.7 180424 1 7.699620
## 6 vet-2 172993 1 8.088581
## 7 Y75B12B.1 190716 3 23.328509
## 8 dod-19 178564 1 8.195626
## 9 otub-3 177679 3 23.840400
## 10 skpo-1 174340 1 9.551359
## 11 F48D6.4 180697 3 15.152133
## 12 col-81 174686 1 7.883827
## 13 B0513.4 178351 1 7.571620
## 14 ddo-3 184746 1 8.419052
## 15 sepa-1 173196 1 6.528977
## 16 cav-1 177815 1 7.539687
## 17 pes-2.1 173025 1 6.538641
## 18 pes-2.2 173026 1 6.538641
## 19 F57F5.1 179645 1 10.940901
## 20 M01G12.9 187385 1 5.192433
## pval qval
## 1 5.755169e-09 1.631159e-05
## 2 3.530000e-09 1.631159e-05
## 3 5.541369e-09 1.631159e-05
## 4 2.585526e-09 1.631159e-05
## 5 8.194098e-09 1.857930e-05
## 6 1.688221e-08 3.189893e-05
## 7 2.016994e-08 3.266666e-05
## 8 2.598034e-08 3.681739e-05
## 9 3.209199e-08 4.042521e-05
## 10 4.835407e-08 5.481901e-05
## 11 7.432027e-08 7.659717e-05
## 12 1.101088e-07 9.775006e-05
## 13 1.143796e-07 9.775006e-05
## 14 1.207110e-07 9.775006e-05
## 15 1.401189e-07 1.059018e-04
## 16 2.276656e-07 1.206595e-04
## 17 2.447886e-07 1.206595e-04
## 18 2.447886e-07 1.206595e-04
## 19 2.350286e-07 1.206595e-04
## 20 2.052485e-07 1.206595e-04
#Note: The Wald test usually returns more hits than the likelihood ratio test
#check counts for top significant genes in WT
selected_gene <- sleuth_significant_wt$target_id[1]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")
selected_gene <- sleuth_significant_wt$target_id[2]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")
selected_gene <- sleuth_significant_wt$target_id[3]
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene==selected_gene))$target_id[1], units = "est_counts", color_by = "condition")
#save the gene differential expression results so they can be used with web servers for further analysis
#check the table has no duplicate entries
length(unique(sleuth_table_wt$target_id)) == length(sleuth_table_wt$target_id)
## [1] TRUE
write.table(sleuth_significant_wt$target_id, file="results/sleuth_significant_q005_NEW.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")
#avoid using this - pointless
write.table(dplyr::filter(sleuth_table_wt, qval > 0.05)$target_id, file="results/bg_genes_good.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")
#use this instead
write.table(dplyr::filter(sleuth_table_wt, !is.na(qval))$target_id, file="results/bg_genes_all.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")
Next, we extract differential expression results from the same sleuth object but this time we do not aggregate p-values so our target IDs are transcript IDs.
##### TRANSCRIPT-LEVEL ANALYSIS #####
#To do the transcript-level analysis, simply set the pval_aggregate to FALSE
#NOTE that target_id will now be the transcript id rather than the gene id
#We will use the Wald test only from here on
sleuth_table_txn <- sleuth_results(so,
test = 'conditionneurexin',
test_type = 'wt',
which_model = 'full',
show_all = TRUE,
pval_aggregate = FALSE)
sleuth_table_txn[1:5,]
## ens_gene target_id
## 1 WBGene00016627 C44B7.5.1
## 2 WBGene00008862 F15D4.5.1
## 3 WBGene00020218 T04G9.7.1
## 4 WBGene00016453 C35E7.1a.1
## 5 WBGene00000657 F38A3.1.1
## description ext_gene
## 1 C44B7.5
## 2 F15D4.5
## 3 T04G9.7
## 4 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7] vet-2
## 5 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135] col-81
## entrezgene_id pval qval b se_b mean_obs var_obs
## 1 174124 7.057950e-85 1.118332e-80 -2.188855 0.1121192 8.681924 1.387181
## 2 174969 1.613360e-81 1.278185e-77 5.590020 0.2923135 5.984510 9.022435
## 3 180424 1.576815e-60 8.328212e-57 -2.091243 0.1274238 7.699620 1.263721
## 4 172993 1.408001e-49 5.577445e-46 -1.915964 0.1294338 8.088581 1.062373
## 5 174686 3.739522e-46 1.185054e-42 2.929165 0.2053731 7.883827 2.523736
## tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 0.006406936 0.014941130 0.01873447 0.01873447
## 2 0.123748994 -0.013684067 0.04714542 0.04714542
## 3 0.011712903 0.004862748 0.02076073 0.02076073
## 4 0.013814870 0.001980859 0.01969134 0.01969134
## 5 0.014387106 0.069969075 0.02012675 0.06996908
sleuth_significant_txn <- dplyr::filter(sleuth_table_txn, qval <= 0.05)
dim(sleuth_significant_txn)
## [1] 2263 15
head(sleuth_significant_txn, 20)
## ens_gene target_id
## 1 WBGene00016627 C44B7.5.1
## 2 WBGene00008862 F15D4.5.1
## 3 WBGene00020218 T04G9.7.1
## 4 WBGene00016453 C35E7.1a.1
## 5 WBGene00000657 F38A3.1.1
## 6 WBGene00022644 ZK6.10a.1
## 7 WBGene00012557 Y37D8A.19.1
## 8 WBGene00003977 F56G4.2.1
## 9 WBGene00010158 F56G4.3.1
## 10 WBGene00010808 M01E5.6.1
## 11 WBGene00007196 B0513.4a.1
## 12 WBGene00018605 F48E3.4.1
## 13 WBGene00017648 F20H11.5.1
## 14 WBGene00000301 T13F2.8a.1
## 15 WBGene00018138 F37B4.7.1
## 16 WBGene00009897 F49E12.1.1
## 17 WBGene00021005 W03F11.1a.1
## 18 WBGene00011432 T04D3.2.1
## 19 WBGene00000703 M18.1a.1
## 20 WBGene00010204 F57F5.1.1
## description
## 1
## 2
## 3
## 4 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7
## 8 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 10
## 11
## 12
## 13 D-aspartate oxidase 3 [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 14 Caveolin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 15 Folate-like transporter 2 [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 16 Peroxidase skpo-1 [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 17 Uterine Lumin Expressed/locailized [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 18 SKN-1 Dependent Zygotic transcript [Source:UniProtKB/TrEMBL;Acc:O02297]
## 19 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q21556]
## 20
## ext_gene entrezgene_id pval qval b se_b
## 1 C44B7.5 174124 7.057950e-85 1.118332e-80 -2.188855 0.1121192
## 2 F15D4.5 174969 1.613360e-81 1.278185e-77 5.590020 0.2923135
## 3 T04G9.7 180424 1.576815e-60 8.328212e-57 -2.091243 0.1274238
## 4 vet-2 172993 1.408001e-49 5.577445e-46 -1.915964 0.1294338
## 5 col-81 174686 3.739522e-46 1.185054e-42 2.929165 0.2053731
## 6 dod-19 178564 6.131492e-41 1.619225e-37 1.624992 0.1212775
## 7 Y37D8A.19 176711 3.821684e-39 8.650655e-36 -1.418486 0.1083750
## 8 pes-2.1 173025 4.000704e-38 7.043462e-35 -2.491918 0.1930361
## 9 pes-2.2 173026 4.000704e-38 7.043462e-35 -2.491918 0.1930361
## 10 sepa-1 173196 8.901306e-38 1.410412e-34 -2.208029 0.1718666
## 11 B0513.4 178351 1.780045e-37 2.564074e-34 -1.673971 0.1308443
## 12 F48E3.4 181006 1.672903e-36 2.208929e-33 -1.422679 0.1127469
## 13 ddo-3 184746 6.915372e-36 8.428775e-33 -1.399326 0.1118917
## 14 cav-1 177815 1.116744e-33 1.263915e-30 -1.591271 0.1315598
## 15 folt-2 178745 3.081094e-32 3.254662e-29 -1.266567 0.1071556
## 16 skpo-1 174340 5.041138e-31 4.992302e-28 -1.228911 0.1060985
## 17 ule-1 171778 2.273631e-29 2.119158e-26 -1.231102 0.1094155
## 18 sdz-30 173198 2.532942e-29 2.229693e-26 -2.685465 0.2388758
## 19 col-129 178155 5.864490e-29 4.890676e-26 2.720450 0.2435987
## 20 F57F5.1 179645 2.536609e-28 2.009629e-25 -1.276974 0.1157005
## mean_obs var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 8.681924 1.3871805 0.006406936 0.0149411299 0.01873447 0.01873447
## 2 5.984510 9.0224349 0.123748994 -0.0136840674 0.04714542 0.04714542
## 3 7.699620 1.2637210 0.011712903 0.0048627478 0.02076073 0.02076073
## 4 8.088581 1.0623734 0.013814870 0.0019808591 0.01969134 0.01969134
## 5 7.883827 2.5237360 0.014387106 0.0699690755 0.02012675 0.06996908
## 6 8.195626 0.7631147 0.009918284 0.0001820417 0.01949818 0.01949818
## 7 9.265287 0.5950208 0.003512276 0.0199779905 0.01899745 0.01997799
## 8 6.538641 1.8380668 0.015789672 0.0587362241 0.03029372 0.05873622
## 9 6.538641 1.8380668 0.015789672 0.0587362241 0.03029372 0.05873622
## 10 6.528977 1.4333994 0.028600125 0.0185683623 0.03047611 0.03047611
## 11 7.571620 0.8217040 0.012797086 0.0117981168 0.02144336 0.02144336
## 12 8.878642 0.6000823 0.004552084 0.0208716297 0.01872224 0.02087163
## 13 8.419052 0.5739371 0.006008771 0.0108797729 0.01903073 0.01903073
## 14 7.539687 0.7471720 0.012961105 0.0146920667 0.02165486 0.02165486
## 15 10.118191 0.4730252 0.001910259 0.0152218968 0.02105439 0.02105439
## 16 9.551359 0.4343759 0.003086547 0.0002776849 0.01942725 0.01942725
## 17 8.688486 0.4531146 0.005211902 0.0182179264 0.01873159 0.01873159
## 18 5.798016 2.1486703 0.057929880 0.0449449658 0.05619342 0.05619342
## 19 8.242931 2.2162546 0.011000825 0.1076798091 0.01940853 0.10767981
## 20 10.940901 0.4787103 0.001103854 0.0138371405 0.02566937 0.02566937
write.table(sleuth_significant_txn$target_id, file="results/sleuth_significant_TXN_q005.txt", quote=FALSE, row.names = FALSE, col.names = "#target_id")
#the transcript table contains beta values too so one can talk about up and down regulated transcripts and genes
sleuth_significant_txn.up <- dplyr::filter(sleuth_table_txn, (qval <= 0.05) & (b>0))
head(sleuth_significant_txn.up, 20)
## ens_gene target_id
## 1 WBGene00008862 F15D4.5.1
## 2 WBGene00000657 F38A3.1.1
## 3 WBGene00022644 ZK6.10a.1
## 4 WBGene00000703 M18.1a.1
## 5 WBGene00017186 F07B7.2.1
## 6 WBGene00021107 W09B7.2.1
## 7 WBGene00000712 F41F3.4.1
## 8 WBGene00001399 F10D2.9.1
## 9 WBGene00017071 D2096.3.1
## 10 WBGene00000998 K07E12.1a.1
## 11 WBGene00019617 K10C2.1.1
## 12 WBGene00010605 K06G5.1a.3
## 13 WBGene00001758 Y45G12C.2a.1
## 14 WBGene00012445 Y16B4A.2.1
## 15 WBGene00010822 M01G12.9.1
## 16 WBGene00020128 R193.2.1
## 17 WBGene00012018 T25C12.3.1
## 18 WBGene00022497 Y119D3B.21.1
## 19 WBGene00003515 K12F2.1.1
## 20 WBGene00021050 W05H9.3
## description
## 1
## 2 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 3 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 4 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q21556]
## 5
## 6
## 7 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20282]
## 8 Delta(9)-fatty-acid desaturase fat-7 [Source:UniProtKB/Swiss-Prot;Acc:G5EGH6]
## 9 Acid Alpha Glucosidase Relate [Source:UniProtKB/TrEMBL;Acc:Q19004]
## 10 Mesocentin [Source:UniProtKB/Swiss-Prot;Acc:Q09165]
## 11 Carboxypeptidase [Source:UniProtKB/TrEMBL;Acc:Q94269]
## 12
## 13 Glutathione S-transferase P 10 [Source:UniProtKB/Swiss-Prot;Acc:Q9N4X8]
## 14 Carboxypeptidase [Source:UniProtKB/TrEMBL;Acc:K8ESM2]
## 15
## 16
## 17
## 18
## 19 Myosin-3 [Source:UniProtKB/Swiss-Prot;Acc:P12844]
## 20
## ext_gene entrezgene_id pval qval b se_b
## 1 F15D4.5 174969 1.613360e-81 1.278185e-77 5.590020 0.2923135
## 2 col-81 174686 3.739522e-46 1.185054e-42 2.929165 0.2053731
## 3 dod-19 178564 6.131492e-41 1.619225e-37 1.624992 0.1212775
## 4 col-129 178155 5.864490e-29 4.890676e-26 2.720450 0.2435987
## 5 F07B7.2 179263 3.689346e-28 2.657167e-25 5.295657 0.4812847
## 6 W09B7.2 189307 3.689346e-28 2.657167e-25 5.295657 0.4812847
## 7 col-139 178857 5.933925e-28 4.087959e-25 2.802625 0.2557083
## 8 fat-7 179100 7.520251e-28 4.964932e-25 1.238670 0.1132366
## 9 aagr-1 177632 2.947247e-26 1.796120e-23 1.285199 0.1212335
## 10 dig-1 175951 5.555927e-24 2.839795e-21 1.448524 0.1434262
## 11 K10C2.1 180915 6.912340e-24 3.422688e-21 1.082287 0.1073911
## 12 K06G5.1 181531 2.726013e-23 1.308899e-20 1.176876 0.1183712
## 13 gst-10 178725 1.757955e-22 8.192588e-20 1.430538 0.1466484
## 14 Y16B4A.2 181572 1.881803e-22 8.519191e-20 1.107607 0.1136243
## 15 M01G12.9 187385 5.564544e-22 2.449172e-19 3.102246 0.3219007
## 16 R193.2 182256 1.774988e-21 7.601264e-19 1.266816 0.1331047
## 17 T25C12.3 259727 2.854599e-21 1.190293e-18 1.082411 0.1143240
## 18 Y119D3B.21 246006 2.328616e-20 8.999250e-18 1.129906 0.1222037
## 19 myo-3 179676 9.198664e-20 3.389601e-17 1.509785 0.1659464
## 20 W05H9.3 <NA> 1.125043e-19 3.961402e-17 1.383175 0.1523970
## mean_obs var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 5.984510 9.0224349 0.123748994 -0.0136840674 0.04714542 0.04714542
## 2 7.883827 2.5237360 0.014387106 0.0699690755 0.02012675 0.06996908
## 3 8.195626 0.7631147 0.009918284 0.0001820417 0.01949818 0.01949818
## 4 8.242931 2.2162546 0.011000825 0.1076798091 0.01940853 0.10767981
## 5 4.548737 8.4096536 0.182373435 0.2808964331 0.21121433 0.28089643
## 6 4.548737 8.4096536 0.182373435 0.2808964331 0.21121433 0.28089643
## 7 8.429522 2.3562939 0.008941758 0.1218317041 0.01901130 0.12183170
## 8 9.018604 0.4603540 0.005038598 0.0206064395 0.01878882 0.02060644
## 9 8.230056 0.4859733 0.009961643 0.0064280629 0.01943347 0.01943347
## 10 9.056253 0.6347564 0.021076051 0.0200660739 0.01881452 0.02006607
## 11 9.067225 0.3466997 0.004243176 0.0097917632 0.01882251 0.01882251
## 12 9.502552 0.4013749 0.008687104 -0.0020950765 0.01933636 0.01933636
## 13 7.269504 0.6215641 0.016870246 0.0261412419 0.02336898 0.02614124
## 14 8.448535 0.3643348 0.006843142 0.0092831966 0.01897781 0.01897781
## 15 5.192433 2.7723589 0.102884208 -0.0764415518 0.10435592 0.10435592
## 16 7.527236 0.4656110 0.013691320 -0.0054193025 0.02174240 0.02174240
## 17 10.267849 0.3571526 0.001730881 0.0244090848 0.02167540 0.02440908
## 18 9.682303 0.3903688 0.002851854 0.0270156523 0.01970800 0.02701565
## 19 8.599196 0.6984802 0.006945897 0.0481305326 0.01878800 0.04813053
## 20 8.060157 0.5864352 0.008808961 0.0376407567 0.01974351 0.03764076
#we will save the gene ids instead of transcript ones
write.table(unique(sleuth_significant_txn.up$ens_gene), file="results/sleuth_significant_q005_up.txt", quote=FALSE, row.names = FALSE, col.names = "#ens_gene")
sleuth_significant_txn.down <- dplyr::filter(sleuth_table_txn, (qval <= 0.05) & (b<0))
head(sleuth_significant_txn.down, 20)
## ens_gene target_id
## 1 WBGene00016627 C44B7.5.1
## 2 WBGene00020218 T04G9.7.1
## 3 WBGene00016453 C35E7.1a.1
## 4 WBGene00012557 Y37D8A.19.1
## 5 WBGene00003977 F56G4.2.1
## 6 WBGene00010158 F56G4.3.1
## 7 WBGene00010808 M01E5.6.1
## 8 WBGene00007196 B0513.4a.1
## 9 WBGene00018605 F48E3.4.1
## 10 WBGene00017648 F20H11.5.1
## 11 WBGene00000301 T13F2.8a.1
## 12 WBGene00018138 F37B4.7.1
## 13 WBGene00009897 F49E12.1.1
## 14 WBGene00021005 W03F11.1a.1
## 15 WBGene00011432 T04D3.2.1
## 16 WBGene00010204 F57F5.1.1
## 17 WBGene00018393 F43E2.5.2
## 18 WBGene00017490 F15E11.1.1
## 19 WBGene00017500 F15E11.14.1
## 20 WBGene00000785 W07B8.5.1
## description
## 1
## 2
## 3 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 4
## 5 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 6 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 7
## 8
## 9
## 10 D-aspartate oxidase 3 [Source:UniProtKB/Swiss-Prot;Acc:O01739]
## 11 Caveolin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q94051]
## 12 Folate-like transporter 2 [Source:UniProtKB/Swiss-Prot;Acc:O45166]
## 13 Peroxidase skpo-1 [Source:UniProtKB/Swiss-Prot;Acc:Q20616]
## 14 Uterine Lumin Expressed/locailized [Source:UniProtKB/TrEMBL;Acc:X5M5P2]
## 15 SKN-1 Dependent Zygotic transcript [Source:UniProtKB/TrEMBL;Acc:O02297]
## 16
## 17 Methionine Sulfoxide Reductase A [Source:UniProtKB/TrEMBL;Acc:O02089]
## 18 Protein Up-regulated in Daf-2(Gf) [Source:UniProtKB/TrEMBL;Acc:G5EBF3]
## 19 Protein Up-regulated in Daf-2(Gf) [Source:UniProtKB/TrEMBL;Acc:G5EBF3]
## 20 Cathepsin B-like cysteine proteinase 5 [Source:UniProtKB/Swiss-Prot;Acc:P43509]
## ext_gene entrezgene_id pval qval b se_b
## 1 C44B7.5 174124 7.057950e-85 1.118332e-80 -2.188855 0.1121192
## 2 T04G9.7 180424 1.576815e-60 8.328212e-57 -2.091243 0.1274238
## 3 vet-2 172993 1.408001e-49 5.577445e-46 -1.915964 0.1294338
## 4 Y37D8A.19 176711 3.821684e-39 8.650655e-36 -1.418486 0.1083750
## 5 pes-2.1 173025 4.000704e-38 7.043462e-35 -2.491918 0.1930361
## 6 pes-2.2 173026 4.000704e-38 7.043462e-35 -2.491918 0.1930361
## 7 sepa-1 173196 8.901306e-38 1.410412e-34 -2.208029 0.1718666
## 8 B0513.4 178351 1.780045e-37 2.564074e-34 -1.673971 0.1308443
## 9 F48E3.4 181006 1.672903e-36 2.208929e-33 -1.422679 0.1127469
## 10 ddo-3 184746 6.915372e-36 8.428775e-33 -1.399326 0.1118917
## 11 cav-1 177815 1.116744e-33 1.263915e-30 -1.591271 0.1315598
## 12 folt-2 178745 3.081094e-32 3.254662e-29 -1.266567 0.1071556
## 13 skpo-1 174340 5.041138e-31 4.992302e-28 -1.228911 0.1060985
## 14 ule-1 171778 2.273631e-29 2.119158e-26 -1.231102 0.1094155
## 15 sdz-30 173198 2.532942e-29 2.229693e-26 -2.685465 0.2388758
## 16 F57F5.1 179645 2.536609e-28 2.009629e-25 -1.276974 0.1157005
## 17 msra-1 185709 4.221976e-27 2.675888e-24 -1.896329 0.1758917
## 18 pud-2.1 178713 2.139720e-25 1.210852e-22 -1.192239 0.1144839
## 19 pud-2.2 178710 2.139720e-25 1.210852e-22 -1.192239 0.1144839
## 20 cpr-5 178612 2.938093e-25 1.605313e-22 -1.261826 0.1215186
## mean_obs var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 8.681924 1.3871805 0.006406936 0.0149411299 0.01873447 0.01873447
## 2 7.699620 1.2637210 0.011712903 0.0048627478 0.02076073 0.02076073
## 3 8.088581 1.0623734 0.013814870 0.0019808591 0.01969134 0.01969134
## 4 9.265287 0.5950208 0.003512276 0.0199779905 0.01899745 0.01997799
## 5 6.538641 1.8380668 0.015789672 0.0587362241 0.03029372 0.05873622
## 6 6.538641 1.8380668 0.015789672 0.0587362241 0.03029372 0.05873622
## 7 6.528977 1.4333994 0.028600125 0.0185683623 0.03047611 0.03047611
## 8 7.571620 0.8217040 0.012797086 0.0117981168 0.02144336 0.02144336
## 9 8.878642 0.6000823 0.004552084 0.0208716297 0.01872224 0.02087163
## 10 8.419052 0.5739371 0.006008771 0.0108797729 0.01903073 0.01903073
## 11 7.539687 0.7471720 0.012961105 0.0146920667 0.02165486 0.02165486
## 12 10.118191 0.4730252 0.001910259 0.0152218968 0.02105439 0.02105439
## 13 9.551359 0.4343759 0.003086547 0.0002776849 0.01942725 0.01942725
## 14 8.688486 0.4531146 0.005211902 0.0182179264 0.01873159 0.01873159
## 15 5.798016 2.1486703 0.057929880 0.0449449658 0.05619342 0.05619342
## 16 10.940901 0.4787103 0.001103854 0.0138371405 0.02566937 0.02566937
## 17 8.569407 1.0743982 0.043060250 0.0117159100 0.01881556 0.01881556
## 18 7.970674 0.4285923 0.004944962 0.0212681820 0.01992090 0.02126818
## 19 7.970674 0.4285923 0.004944962 0.0212681820 0.01992090 0.02126818
## 20 9.223471 0.4802299 0.004121652 0.0254118707 0.01895503 0.02541187
write.table(unique(sleuth_significant_txn.down$ens_gene), file="results/sleuth_significant_q005_down.txt", quote=FALSE, row.names = FALSE, col.names = "#ens_gene")
#Summarise some of the numbers
#all transcripts and genes considered in the analysis
length(sleuth_table_txn$target_id)
## [1] 61428
length(sleuth_table_wt$target_id)
## [1] 46904
length(unique(sleuth_table_wt$target_id))
## [1] 46904
#transcripts and genes with sleuth data
sum(sleuth_table_wt$num_aggregated_transcripts, na.rm=TRUE )
## [1] 15845
length(sleuth_table_wt[!is.na(sleuth_table_wt$num_aggregated_transcripts),] $target_id)
## [1] 11337
#significantly DE genes (aggregated) at q<0.05
length(sleuth_significant_wt$target_id)
## [1] 2328
#significantly DE transcripts
length(sleuth_significant_txn$target_id)
## [1] 2263
#volcano plot for genes
#Note: there is a problem with volcano and MA plots not setting hte p-value aggregate to FALSE when reading
#results from the WT test and thus not realising there is a b column
#Error reported is:
#Error in FUN(X[[i]], ...) : object 'b' not found
#To fix this, manually set the pval_aggregate in so before supplying to plots as suggested here:
#https://github.com/pachterlab/sleuth/issues/233
# so$pval_aggregate <- FALSE
# plot_volcano(obj=so,
# test = "conditionneurexin",
# test_type = "wt",
# which_model="full",
# sig_level = 0.05,
# sig_color = "red")
#can also do MA plot
# plot_ma(obj=so,
# test = "conditionneurexin",
# test_type = "wt",
# which_model="full",
# sig_level = 0.05,
# sig_color = "red")
# ALTERNATIVELY
#use the EnhancedVolcano package and function to make nicer plots and to also highlight
#genes easily (the highlight function in plot_volcano from sleuth is not working for me)
#not used as using cutoffs within the volcano plot options
#select.lab <- head(sleuth_significant_txn, 20)$target_id
#use the same function that plot_volcano() uses to produce the data frame needed to make the plot
toptable <- sleuth_results(so,
test="conditionneurexin",
test_type="wt",
which_model="full",
pval_aggregate=FALSE,
rename_cols=FALSE,
show_all=FALSE)
toptable <- dplyr::mutate(toptable, significant = qval < 0.05)
dim(dplyr::filter(toptable, abs(b)>1))
## [1] 3236 16
#there are 2045 transcripts
EnhancedVolcano(
toptable=toptable,
lab=toptable$target_id,
x='b',
y='qval',
xlab="beta_value",
selectLab = NULL,
title="Sleuth results (without p-value aggregation)",
subtitle="",
legend=c("NS","beta","P","P & beta"),
legendLabels=c("NS","beta","P","P & beta"),
transcriptLabSize=5.0,
pLabellingCutoff=10e-6,
FCcutoff=1.0
)
## Warning in EnhancedVolcano(toptable = toptable, lab = toptable$target_id, :
## transcriptLabSize argument deprecated in v1.4 - please use labSize
#create a table with information about top 10 most significant genes
select.res <- dplyr::filter(toptable, abs(b)>1)[1:10,c(1,2,5,6,7)]
dplyr::filter(t2g, target_id %in% select.res$target_id)
## ens_gene target_id
## 1 WBGene00000657 F38A3.1.1
## 2 WBGene00003977 F56G4.2.1
## 3 WBGene00008862 F15D4.5.1
## 4 WBGene00010158 F56G4.3.1
## 5 WBGene00010808 M01E5.6.1
## 6 WBGene00012557 Y37D8A.19.1
## 7 WBGene00016453 C35E7.1a.1
## 8 WBGene00016627 C44B7.5.1
## 9 WBGene00020218 T04G9.7.1
## 10 WBGene00022644 ZK6.10a.1
## description
## 1 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 2 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 3
## 4 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 5
## 6
## 7 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 8
## 9
## 10 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## ext_gene entrezgene_id
## 1 col-81 174686
## 2 pes-2.1 173025
## 3 F15D4.5 174969
## 4 pes-2.2 173026
## 5 sepa-1 173196
## 6 Y37D8A.19 176711
## 7 vet-2 172993
## 8 C44B7.5 174124
## 9 T04G9.7 180424
## 10 dod-19 178564
merge(x=select.res, y=dplyr::filter(t2g, target_id %in% select.res$target_id), by="target_id", sort=FALSE, no.dups=TRUE)[,c("target_id","ext_gene","entrezgene_id.x","qval","description")]
## target_id ext_gene entrezgene_id.x qval
## 1 C44B7.5.1 C44B7.5 174124 1.118332e-80
## 2 F15D4.5.1 F15D4.5 174969 1.278185e-77
## 3 T04G9.7.1 T04G9.7 180424 8.328212e-57
## 4 C35E7.1a.1 vet-2 172993 5.577445e-46
## 5 F38A3.1.1 col-81 174686 1.185054e-42
## 6 ZK6.10a.1 dod-19 178564 1.619225e-37
## 7 Y37D8A.19.1 Y37D8A.19 176711 8.650655e-36
## 8 F56G4.2.1 pes-2.1 173025 7.043462e-35
## 9 F56G4.3.1 pes-2.2 173026 7.043462e-35
## 10 M01E5.6.1 sepa-1 173196 1.410412e-34
## description
## 1
## 2
## 3
## 4 Very Early Transcript [Source:UniProtKB/TrEMBL;Acc:H1ZUW7]
## 5 COLlagen [Source:UniProtKB/TrEMBL;Acc:Q20135]
## 6 Downstream Of DAF-16 (Regulated by DAF-16) [Source:UniProtKB/TrEMBL;Acc:A0A0K3ASK2]
## 7
## 8 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 9 Patterned Expression Site; Pes-2 protein [Source:UniProtKB/TrEMBL;Acc:G5ECQ5]
## 10
#Examine some bootstrapped counts for individual transcripts of the same gene (note these are log2 transformed by sleuth)
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[1], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[2], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[3], units = "est_counts", color_by = "condition")
plot_bootstrap(so, target_id=(dplyr::filter(t2g, ens_gene=="WBGene00019606"))$target_id[4], units = "est_counts", color_by = "condition")
#use the shiny app when not knitting the document
#sleuth_live(so)
A note on sleuth counts
The "raw" counts of sleuth correspond more or less to assay(dds) in DESeq2. The "est_counts" from the normalised data frame (so[["obs_norm""]]) are on a similar scale to raw counts and correlate well. The tpm values take into account the length of the genes/transcripts and show reduced correlation to the est_counts. Neither est_counts or tpm values are stored as log normalised - a log must be taken, if one wants to have logged values. However, when the diff expression results are obtained the mean_obs is a value obtained as the log (or log2 if the transformation function has been changed to log2) of the mean of observations across all samples.
As the est_counts are not log-transformed, the PCA plots based on them are likely dominated simply by the genes with very high read counts.
#pick a gene with large counts
t2g[t2g$target_id == "ZK185.9",]
## ens_gene target_id description ext_gene entrezgene_id
## 60688 WBGene00220238 ZK185.9 ZK185.9 NA
#DEseq2 counts
assay(dds)["WBGene00220238",]
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## 9234 8849 10032 7426 6046 5497 5767 7229
#log2-transformed
log2(assay(dds)["WBGene00220238",])
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## 13.17274 13.11130 13.29232 12.85837 12.56177 12.42443 12.49361 12.81958
#rlog-transformed
assay(rld)["WBGene00220238",]
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## 13.20606 13.00318 13.35027 12.82825 12.54772 12.58265 12.59644 12.63061
#raw counts from sleuth
filter(so[["obs_raw"]], so[["obs_raw"]]$target_id=="ZK185.9")$est_counts
## [1] 9262 8902 10082 7458 6062 5510 5794 7255
#normalised counts from_sleuth
filter(so[["obs_norm"]], so[["obs_norm"]]$target_id=="ZK185.9")$est_counts
## [1] 10904.018 8848.183 12688.245 7280.942 5331.194 5580.740 5628.467
## [8] 5831.883
#log2 and normalised counts from sleuth
mean(log2(filter(so[["obs_norm"]], so[["obs_norm"]]$target_id=="ZK185.9")$est_counts))
## [1] 12.84745
#mean_obs from sleuth transcript table results
filter(sleuth_table_txn, target_id=="ZK185.9")$mean_obs
## [1] 12.84755
#sleuth_matrix results should agree
sleuth_matrix <- sleuth_to_matrix(so, 'obs_norm', 'est_counts')
sleuth_matrix.raw <- sleuth_to_matrix(so, 'obs_raw', 'est_counts')
sleuth_matrix.tpm <- sleuth_to_matrix(so, 'obs_norm', 'tpm')
sleuth_matrix[59889,]
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## 10904.018 8848.183 12688.245 7280.942 5331.194 5580.740 5628.467 5831.883
log2(mean(sleuth_matrix[59889,]))
## [1] 12.92216
log2(mean(sleuth_matrix.raw[59889,]))
## [1] 12.88047
sleuth_matrix.tpm[59889,]
## N1_S5 N2_S6 N3_S7 N4_S8 SG1_S1 SG2_S2 SG3_S3 SG4_S4
## 32939.89 27935.04 39259.63 23175.80 16106.07 17440.62 17340.28 18035.60
#Here, we check PCA plots based on kallisto counts
#NOTE: all PCA plots below are done with the transcript counts as sleuth_to_matrix won't return gene-summarised counts
#first, using sleuth's own plot_pca (this, it seems, uses transcript counts after all)
#In addition, note that est_counts in sleuth are not log2 normalised
#see: https://hbctraining.github.io/DGE_workshop_salmon/lessons/09_sleuth.html
plot_pca(so,
color_by = 'condition',
text_labels = TRUE)
#and then using the PCAtools package
#need first the counts table (which_units would be scaled_reads_per_base for genes but it doesn't work for aggregation of pvalues (probably only for the old gene_mode=true))
#sleuth_matrix <- sleuth_to_matrix(so, 'obs_norm', 'scaled_reads_per_base')
sleuth_matrix <- sleuth_to_matrix(so, 'obs_norm', 'est_counts')
dim(sleuth_matrix) #clearly this is counts of transcripts, not genes
## [1] 61428 8
#alternative (this is how plot_pca in sleuth gets the matrix) - THIS WON'T WORK because sleuth.R does
#not make the spread_abundance_by function available...
#mat <- spread_abundance_by(so$obs_norm_filt, 'est_counts',
# so$sample_to_covariates$sample)
#will remove using removeVar a % of variables with low variance: we want to keep about 500 transcripts and we have about 60,000 so we will remove 99% of all variables
p<- pca(sleuth_matrix, metadata=coldata, removeVar=0.99)
## -- removing the lower 99% of variables based on variance
dim(p$loadings) #should give about 600 transcripts
## [1] 614 8
#this is essentially the same PCA as by doing: prcomp(t(sleuth_matrix)), scale=FALSE, center=TRUE)
screeplot(p)
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
#the PCA plot is virtually identical to the one drawn by sleuth's plot_pca()
biplot(p, colby="condition", labSize=4.0, widthConnectors=0.2, xlim=c(-25000, 20000), title="Kallisto est_counts")
pairsplot(p, colby= "condition")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
plotloadings(p)
## -- variables retained:
## C24F3.6.1, C42D8.2a.1, F31E3.5.2, MTCE.26.1, B0035.10.1, K07H8.6a.1, H02I12.7.1, ZC155.8, H02I12.6.1, T27E9.1d.3
#get the top 20 for PC1 and PC2
rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20]
## [1] "F58E10.3a.1" "Y67D8C.5h.1" "C23G10.3.1" "Y105E8B.1a.1" "F25H2.5.1"
## [6] "W09C5.6a.1" "C07G2.3a.1" "T23D8.3.1" "T27E9.7.2" "F28C6.7a.1"
## [11] "F01G4.6a.3" "H06I04.4a.3" "T25G3.2.1" "Y17G7B.5a.1" "C36E6.5.1"
## [16] "C47D12.1b.1" "T15B7.3.1" "C53D5.6.1" "C14B9.7.1" "T23G11.3.2"
#or getting all the info about these transcripts
t2g[t2g$target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20],]
## ens_gene target_id
## 628 WBGene00000380 C07G2.3a.1
## 839 WBGene00000496 T25G3.2.1
## 1148 WBGene00000716 T15B7.3.1
## 2842 WBGene00001595 T23G11.3.2
## 3592 WBGene00002077 C53D5.6.1
## 4101 WBGene00002978 Y105E8B.1a.1
## 4459 WBGene00003154 Y17G7B.5a.1
## 4921 WBGene00003370 C36E6.5.1
## 6953 WBGene00004433 C14B9.7.1
## 6967 WBGene00004440 F28C6.7a.1
## 6980 WBGene00004445 W09C5.6a.1
## 7044 WBGene00004472 C23G10.3.1
## 9943 WBGene00006725 H06I04.4a.3
## 10848 WBGene00007028 C47D12.1b.1
## 13058 WBGene00008505 F01G4.6a.3
## 13949 WBGene00009119 F25H2.5.1
## 15632 WBGene00010260 F58E10.3a.1
## 18140 WBGene00011944 T23D8.3.1
## 18374 WBGene00012097 T27E9.7.2
## 32556 WBGene00022069 Y67D8C.5h.1
## description
## 628 T-complex protein 1 subunit epsilon [Source:UniProtKB/Swiss-Prot;Acc:P47209]
## 839 Chitin synthase chs-1 [Source:UniProtKB/Swiss-Prot;Acc:G5ECD6]
## 1148 COLlagen [Source:UniProtKB/TrEMBL;Acc:O17038]
## 2842 Female germline-specific tumor suppressor gld-1 [Source:UniProtKB/Swiss-Prot;Acc:Q17339]
## 3592 IMportin Beta family [Source:UniProtKB/TrEMBL;Acc:Q9N5V3]
## 4101 Tropomyosin isoforms a/b/d/f [Source:UniProtKB/Swiss-Prot;Acc:Q22866]
## 4459 DNA helicase [Source:UniProtKB/TrEMBL;Acc:Q7K7J6]
## 4921 Myosin regulatory light chain 2 [Source:UniProtKB/Swiss-Prot;Acc:P19626]
## 6953 60S ribosomal protein L21 [Source:UniProtKB/Swiss-Prot;Acc:P34334]
## 6967 60S ribosomal protein L26 [Source:UniProtKB/Swiss-Prot;Acc:Q19869]
## 6980 60S ribosomal protein L31 [Source:UniProtKB/Swiss-Prot;Acc:Q9U332]
## 7044 40S ribosomal protein S3 [Source:UniProtKB/Swiss-Prot;Acc:P48152]
## 9943 Ubiquitin-like protein 1-40S ribosomal protein S27a Ubiquitin-like protein 1 40S ribosomal protein S27a [Source:UniProtKB/Swiss-Prot;Acc:P37165]
## 10848 Transcription-associated protein 1 [Source:UniProtKB/Swiss-Prot;Acc:G5EEV2]
## 13058 Phosphate carrier protein, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P40614]
## 13949 Nucleoside diphosphate kinase [Source:UniProtKB/TrEMBL;Acc:Q93576]
## 15632 DEAD boX helicase homolog [Source:UniProtKB/TrEMBL;Acc:Q9XUW5]
## 18140
## 18374 ABC transporter, class F [Source:UniProtKB/TrEMBL;Acc:G5EFG4]
## 32556 Enhancer of EfL-1 mutant phenotype [Source:UniProtKB/TrEMBL;Acc:U4PBK1]
## ext_gene entrezgene_id
## 628 cct-5 175588
## 839 chs-1 172508
## 1148 col-143 179073
## 2842 gld-1 172532
## 3592 imb-3 3565440
## 4101 lev-11 173319
## 4459 mcm-2 174841
## 4921 mlc-2 181775
## 6953 rpl-21 176146
## 6967 rpl-26 174379
## 6980 rpl-31 173235
## 7044 rps-3 175879
## 9943 ubl-1 175413
## 10848 trr-1 174788
## 13058 F01G4.6 178020
## 13949 ndk-1 172939
## 15632 ddx-17 179897
## 18140 T23D8.3 172857
## 18374 abcf-2 176771
## 32556 eel-1 177084
rownames(p$loadings)[order(abs(p$loadings[,2]))][1:20]
## [1] "Y111B2A.18.5" "T19H12.2.1" "H06H21.3.1" "K01G5.4.1" "ZC434.5.1"
## [6] "ZK813.3.1" "C17F4.7.1" "C53A5.1.1" "C44B12.1.1" "ZC395.10.2"
## [11] "T04H1.4a.1" "Y73B6BL.38.1" "Y37E3.8a.1" "F14H3.6.1" "Y67D8C.5h.1"
## [16] "R06C7.4.1" "C26F1.9.1" "Y67D8C.5d.1" "K12C11.2.1" "F10B5.1.2"
t2g[t2g$target_id %in% rownames(p$loadings)[order(abs(p$loadings[,2]))][1:20],]
## ens_gene target_id
## 2274 WBGene00001337 ZC434.5.1
## 6665 WBGene00004296 T04H1.4a.1
## 6675 WBGene00004302 K01G5.4.1
## 6922 WBGene00004421 F10B5.1.2
## 7009 WBGene00004453 C26F1.9.1
## 7147 WBGene00004700 Y111B2A.18.5
## 7470 WBGene00004888 K12C11.2.1
## 12664 WBGene00008262 C53A5.1.1
## 13567 WBGene00008825 F14H3.6.1
## 16839 WBGene00011063 R06C7.4.1
## 23548 WBGene00015913 C17F4.7.1
## 24671 WBGene00016636 C44B12.1.1
## 28243 WBGene00019162 H06H21.3.1
## 30240 WBGene00020588 T19H12.2.1
## 31369 WBGene00021350 Y37E3.8a.1
## 32552 WBGene00022069 Y67D8C.5d.1
## 32556 WBGene00022069 Y67D8C.5h.1
## 32824 WBGene00022257 Y73B6BL.38.1
## 33312 WBGene00022599 ZC395.10.2
## 33639 WBGene00022822 ZK813.3.1
## description
## 2274 Glutamyl(E) Amino-acyl tRNA Synthetase [Source:UniProtKB/TrEMBL;Acc:Q23315]
## 6665 DNA repair protein rad-50 [Source:UniProtKB/Swiss-Prot;Acc:O44199]
## 6675 GTP-binding nuclear protein ran-1 [Source:UniProtKB/Swiss-Prot;Acc:O17915]
## 6922 60S ribosomal protein L10 [Source:UniProtKB/Swiss-Prot;Acc:Q09533]
## 7009 60S ribosomal protein L39 [Source:UniProtKB/Swiss-Prot;Acc:P52814]
## 7147 Probable splicing factor, arginine/serine-rich 3 [Source:UniProtKB/Swiss-Prot;Acc:Q9NEW6]
## 7470 Small ubiquitin-related modifier [Source:UniProtKB/Swiss-Prot;Acc:P55853]
## 12664 RNAi-Induced Longevity [Source:UniProtKB/TrEMBL;Acc:O17694]
## 13567
## 16839 Chondroitin proteoglycan 3 [Source:UniProtKB/Swiss-Prot;Acc:Q21771]
## 23548
## 24671 PERMeable eggshell [Source:UniProtKB/TrEMBL;Acc:O44145]
## 28243 Eukaryotic Initiation Factor [Source:UniProtKB/TrEMBL;Acc:Q9TXU7]
## 30240 Acidic leucine-rich nuclear phosphoprotein 32-related protein 2 [Source:UniProtKB/Swiss-Prot;Acc:O01615]
## 31369
## 32552 Enhancer of EfL-1 mutant phenotype [Source:UniProtKB/TrEMBL;Acc:U4PBK1]
## 32556 Enhancer of EfL-1 mutant phenotype [Source:UniProtKB/TrEMBL;Acc:U4PBK1]
## 32824 PUF (Pumilio/FBF) domain-containing [Source:UniProtKB/TrEMBL;Acc:Q8IU00]
## 33312 Co-chaperone protein daf-41 [Source:UniProtKB/Swiss-Prot;Acc:Q23280]
## 33639
## ext_gene entrezgene_id
## 2274 ears-1 172904
## 6665 rad-50 179678
## 6675 ran-1 176503
## 6922 rpl-10 174311
## 7009 rpl-39 179153
## 7147 rsp-3 176688
## 7470 smo-1 266820
## 12664 ril-1 179958
## 13567 F14H3.6 180077
## 16839 cpg-3 172465
## 23548 C17F4.7 173668
## 24671 perm-2 176933
## 28243 eif-1.A 177254
## 30240 T19H12.2 178881
## 31369 Y37E3.8 171767
## 32552 eel-1 177084
## 32556 eel-1 177084
## 32824 puf-11 177406
## 33312 daf-41 175727
## 33639 ZK813.3 180634
#NOTE: the est_counts are not logged in the PCA so we expect the largest counts to dominate as they have more variance in RNA-seq data
quantile(sleuth_table_txn$mean_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.7675163 3.5198016 5.7369968 7.8353703 14.8657162
quantile(sleuth_table_txn$var_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.001561586 0.071094574 0.203265678 0.902161600 27.826633120
filter(sleuth_table_txn, target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20]) #counts here are log2 counts
## ens_gene target_id
## 1 WBGene00003370 C36E6.5.1
## 2 WBGene00012097 T27E9.7.2
## 3 WBGene00002978 Y105E8B.1a.1
## 4 WBGene00000716 T15B7.3.1
## 5 WBGene00000496 T25G3.2.1
## 6 WBGene00008505 F01G4.6a.3
## 7 WBGene00004440 F28C6.7a.1
## 8 WBGene00003154 Y17G7B.5a.1
## 9 WBGene00004433 C14B9.7.1
## 10 WBGene00004445 W09C5.6a.1
## 11 WBGene00000380 C07G2.3a.1
## 12 WBGene00006725 H06I04.4a.3
## 13 WBGene00004472 C23G10.3.1
## 14 WBGene00009119 F25H2.5.1
## 15 WBGene00001595 T23G11.3.2
## 16 WBGene00011944 T23D8.3.1
## 17 WBGene00010260 F58E10.3a.1
## 18 WBGene00002077 C53D5.6.1
## 19 WBGene00022069 Y67D8C.5h.1
## 20 WBGene00007028 C47D12.1b.1
## description
## 1 Myosin regulatory light chain 2 [Source:UniProtKB/Swiss-Prot;Acc:P19626]
## 2 ABC transporter, class F [Source:UniProtKB/TrEMBL;Acc:G5EFG4]
## 3 Tropomyosin isoforms a/b/d/f [Source:UniProtKB/Swiss-Prot;Acc:Q22866]
## 4 COLlagen [Source:UniProtKB/TrEMBL;Acc:O17038]
## 5 Chitin synthase chs-1 [Source:UniProtKB/Swiss-Prot;Acc:G5ECD6]
## 6 Phosphate carrier protein, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P40614]
## 7 60S ribosomal protein L26 [Source:UniProtKB/Swiss-Prot;Acc:Q19869]
## 8 DNA helicase [Source:UniProtKB/TrEMBL;Acc:Q7K7J6]
## 9 60S ribosomal protein L21 [Source:UniProtKB/Swiss-Prot;Acc:P34334]
## 10 60S ribosomal protein L31 [Source:UniProtKB/Swiss-Prot;Acc:Q9U332]
## 11 T-complex protein 1 subunit epsilon [Source:UniProtKB/Swiss-Prot;Acc:P47209]
## 12 Ubiquitin-like protein 1-40S ribosomal protein S27a Ubiquitin-like protein 1 40S ribosomal protein S27a [Source:UniProtKB/Swiss-Prot;Acc:P37165]
## 13 40S ribosomal protein S3 [Source:UniProtKB/Swiss-Prot;Acc:P48152]
## 14 Nucleoside diphosphate kinase [Source:UniProtKB/TrEMBL;Acc:Q93576]
## 15 Female germline-specific tumor suppressor gld-1 [Source:UniProtKB/Swiss-Prot;Acc:Q17339]
## 16
## 17 DEAD boX helicase homolog [Source:UniProtKB/TrEMBL;Acc:Q9XUW5]
## 18 IMportin Beta family [Source:UniProtKB/TrEMBL;Acc:Q9N5V3]
## 19 Enhancer of EfL-1 mutant phenotype [Source:UniProtKB/TrEMBL;Acc:U4PBK1]
## 20 Transcription-associated protein 1 [Source:UniProtKB/Swiss-Prot;Acc:G5EEV2]
## ext_gene entrezgene_id pval qval b se_b mean_obs
## 1 mlc-2 181775 0.1573189 0.4156609 0.20255406 0.1432342 10.917980
## 2 abcf-2 176771 0.2613953 0.5546491 1.88264129 1.6762909 8.040718
## 3 lev-11 173319 0.2672519 0.5625815 0.17964413 0.1619271 10.507566
## 4 col-143 179073 0.3920183 0.6871922 0.17741363 0.2072679 11.359617
## 5 chs-1 172508 0.3929816 0.6880750 -0.12320007 0.1442247 10.537196
## 6 F01G4.6 178020 0.4183658 0.7107329 0.39480216 0.4878560 9.811750
## 7 rpl-26 174379 0.4201749 0.7121933 0.10188433 0.1263889 11.523997
## 8 mcm-2 174841 0.4287614 0.7185601 -0.58715189 0.7419972 9.186787
## 9 rpl-21 176146 0.4320922 0.7207602 0.11262738 0.1433618 11.939822
## 10 rpl-31 173235 0.4966842 0.7659363 0.07988817 0.1175317 11.058116
## 11 cct-5 175588 0.5824013 0.8167949 -0.17780989 0.3233610 9.867698
## 12 ubl-1 175413 0.5989128 0.8264193 0.07082058 0.1346489 11.486430
## 13 rps-3 175879 0.6171552 0.8384484 0.07756783 0.1551710 12.547761
## 14 ndk-1 172939 0.6274093 0.8432864 0.06690722 0.1378460 11.993970
## 15 gld-1 172532 0.6503874 0.8555740 0.11618394 0.2563499 10.276923
## 16 T23D8.3 172857 0.8093159 0.9345974 -0.33352247 1.3821404 7.392134
## 17 ddx-17 179897 0.8927598 0.9677811 -0.02900975 0.2151852 10.601373
## 18 imb-3 3565440 0.9493524 0.9862971 -0.01218098 0.1917661 10.965742
## 19 eel-1 177084 0.9607273 0.9890853 0.01289716 0.2619195 10.180434
## 20 trr-1 174788 0.9780775 0.9928535 -0.03543874 1.2896562 8.762508
## var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 0.04689265 0.0012062850 0.0398257590 0.02549664 0.03982576
## 2 5.82972701 1.9460234703 3.6738786294 0.01978004 3.67387863
## 3 0.05416982 0.0052782755 0.0471625105 0.02285893 0.04716251
## 4 0.08263871 0.0007424335 0.0851775337 0.02937493 0.08517753
## 5 0.03999512 0.0177451347 0.0238564219 0.02302239 0.02385642
## 6 0.45253984 0.0293356041 0.4466712882 0.02003988 0.44667129
## 7 0.02816261 0.0007992522 0.0285969852 0.03114905 0.03114905
## 8 1.04231599 0.4422604901 0.6588590533 0.01892136 0.65885905
## 9 0.03885730 0.0004983207 0.0406068850 0.03661105 0.04060688
## 10 0.02186696 0.0010285086 0.0223555711 0.02659887 0.02659887
## 11 0.18828292 0.0432134808 0.1659111355 0.02020059 0.16591114
## 12 0.03251359 0.0008267791 0.0354338952 0.03072603 0.03543390
## 13 0.02374789 0.0003575085 0.0253427731 0.04779856 0.04779856
## 14 0.03063936 0.0005643507 0.0336893831 0.03743869 0.03743869
## 15 0.11651154 0.0047859463 0.1266446135 0.02171587 0.12664461
## 16 1.89883705 3.7979747665 -1.6197439567 0.02264961 0.02264961
## 17 0.07961986 0.0305765634 0.0620327569 0.02339009 0.06203276
## 18 0.06308395 0.0010506232 0.0724978597 0.02585985 0.07249786
## 19 0.09989635 0.1159014447 0.0005888509 0.02130218 0.02130218
## 20 0.54414177 3.3077146586 -2.6733012304 0.01871170 0.01871170
filter(so[["obs_norm"]], target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20])
## target_id sample est_counts tpm eff_len len
## 1 C07G2.3a.1 N1_S5 565.792627 65.9835239 1762.1665 1928
## 2 C07G2.3a.1 N2_S6 1633.898377 189.8387113 1749.8902 1928
## 3 C07G2.3a.1 N3_S7 938.042353 109.0026126 1756.9261 1928
## 4 C07G2.3a.1 N4_S8 1121.754656 130.6415994 1750.0664 1928
## 5 C07G2.3a.1 SG1_S1 935.462090 110.3560996 1755.0990 1928
## 6 C07G2.3a.1 SG2_S2 896.176158 105.0782241 1751.9367 1928
## 7 C07G2.3a.1 SG3_S3 776.560620 91.5892399 1758.6784 1928
## 8 C07G2.3a.1 SG4_S4 912.542312 107.3256718 1756.0500 1928
## 9 C14B9.7.1 N1_S5 4046.329996 2258.6147685 368.1665 534
## 10 C14B9.7.1 N2_S6 3345.650763 1910.7989267 355.9886 534
## 11 C14B9.7.1 N3_S7 4252.487568 2391.6720608 363.0023 534
## 12 C14B9.7.1 N4_S8 3537.963810 2024.5625675 356.1724 534
## 13 C14B9.7.1 SG1_S1 3987.402183 2285.8846728 361.1665 534
## 14 C14B9.7.1 SG2_S2 3294.763435 1890.3256779 358.0349 534
## 15 C14B9.7.1 SG3_S3 4849.379812 2758.2378516 364.6784 534
## 16 C14B9.7.1 SG4_S4 4368.887950 2490.9473398 362.2379 534
## 17 C23G10.3.1 N1_S5 6729.363473 2156.8915674 641.1665 807
## 18 C23G10.3.1 N2_S6 5071.155731 1639.4669422 628.8902 807
## 19 C23G10.3.1 N3_S7 5887.285245 1890.0634762 635.9261 807
## 20 C23G10.3.1 N4_S8 5743.333635 1860.8257213 629.0664 807
## 21 C23G10.3.1 SG1_S1 6216.794450 2029.9307633 634.0990 807
## 22 C23G10.3.1 SG2_S2 5316.388955 1730.8870226 630.9367 807
## 23 C23G10.3.1 SG3_S3 6852.468989 2228.9529891 637.6784 807
## 24 C23G10.3.1 SG4_S4 6317.403937 2054.4367795 635.0872 807
## 25 C36E6.5.1 N1_S5 2194.932340 924.0134801 488.1665 654
## 26 C36E6.5.1 N2_S6 1553.680577 663.7820792 475.8902 654
## 27 C36E6.5.1 N3_S7 2065.416054 873.1054381 482.9574 654
## 28 C36E6.5.1 N4_S8 1500.820981 642.5390254 476.0664 654
## 29 C36E6.5.1 SG1_S1 1971.328879 848.3917238 481.0990 654
## 30 C36E6.5.1 SG2_S2 2100.956693 902.9932260 477.9367 654
## 31 C36E6.5.1 SG3_S3 2181.220415 933.4722941 484.6784 654
## 32 C36E6.5.1 SG4_S4 2052.166454 879.0736131 482.1418 654
## 33 C47D12.1b.1 N1_S5 183.417332 3.0624724 12308.1665 12474
## 34 C47D12.1b.1 N2_S6 622.174884 10.2878122 12295.8902 12474
## 35 C47D12.1b.1 N3_S7 564.028707 9.3596745 12302.9261 12474
## 36 C47D12.1b.1 N4_S8 577.432770 9.5713535 12296.0664 12474
## 37 C47D12.1b.1 SG1_S1 280.671070 4.7241698 12301.0990 12474
## 38 C47D12.1b.1 SG2_S2 283.939459 4.7427610 12297.9367 12474
## 39 C47D12.1b.1 SG3_S3 827.728301 13.9531753 12304.6784 12474
## 40 C47D12.1b.1 SG4_S4 510.786499 8.5753015 12302.0500 12474
## 41 C53D5.6.1 N1_S5 1963.587047 113.8573858 3544.1665 3710
## 42 C53D5.6.1 N2_S6 1707.527528 98.2949152 3531.8902 3710
## 43 C53D5.6.1 N3_S7 2134.188352 123.1201693 3538.9261 3710
## 44 C53D5.6.1 N4_S8 2271.521469 131.0768870 3532.0664 3710
## 45 C53D5.6.1 SG1_S1 1962.470963 114.8755989 3537.0990 3710
## 46 C53D5.6.1 SG2_S2 1438.019837 83.5879848 3533.9367 3710
## 47 C53D5.6.1 SG3_S3 2275.700787 133.3165894 3540.6784 3710
## 48 C53D5.6.1 SG4_S4 2446.843014 142.8333495 3538.0500 3710
## 49 F01G4.6a.3 N1_S5 1199.679839 250.0001510 986.1665 1152
## 50 F01G4.6a.3 N2_S6 412.613208 86.1397623 973.8902 1152
## 51 F01G4.6a.3 N3_S7 1179.461018 245.4795164 980.9261 1152
## 52 F01G4.6a.3 N4_S8 644.628582 134.8835954 974.0664 1152
## 53 F01G4.6a.3 SG1_S1 916.048019 193.7150706 979.0990 1152
## 54 F01G4.6a.3 SG2_S2 582.534896 122.6134330 975.9367 1152
## 55 F01G4.6a.3 SG3_S3 1416.000581 298.8873604 982.6784 1152
## 56 F01G4.6a.3 SG4_S4 1489.382410 313.8668679 980.0500 1152
## 57 F25H2.5.1 N1_S5 4485.485136 1869.1364071 493.1665 659
## 58 F25H2.5.1 N2_S6 3496.731843 1478.3829649 480.8902 659
## 59 F25H2.5.1 N3_S7 4014.630145 1679.6995360 487.9574 659
## 60 F25H2.5.1 N4_S8 4004.615769 1696.6567018 481.0664 659
## 61 F25H2.5.1 SG1_S1 4295.207822 1829.4950856 486.0990 659
## 62 F25H2.5.1 SG2_S2 3355.533741 1427.2799738 482.9367 659
## 63 F25H2.5.1 SG3_S3 4700.750983 1991.1860541 489.6784 659
## 64 F25H2.5.1 SG4_S4 4480.622159 1899.6359120 487.1418 659
## 65 F28C6.7a.1 N1_S5 3384.256498 2293.0075869 303.3074 469
## 66 F28C6.7a.1 N2_S6 2494.390535 1740.4850788 291.3833 469
## 67 F28C6.7a.1 N3_S7 2810.829141 1924.7783971 298.1410 469
## 68 F28C6.7a.1 N4_S8 2750.204724 1922.0556362 291.6334 469
## 69 F28C6.7a.1 SG1_S1 2951.481603 2060.9629545 296.5116 469
## 70 F28C6.7a.1 SG2_S2 2693.786423 1885.8111975 293.4288 469
## 71 F28C6.7a.1 SG3_S3 3503.309227 2423.7831937 299.8059 469
## 72 F28C6.7a.1 SG4_S4 3107.762471 2158.5699941 297.3508 469
## 73 F58E10.3a.1 N1_S5 1255.679622 123.5773521 2088.1665 2254
## 74 F58E10.3a.1 N2_S6 1554.291074 152.2293904 2075.8902 2254
## 75 F58E10.3a.1 N3_S7 1802.236523 176.6468423 2082.9261 2254
## 76 F58E10.3a.1 N4_S8 1721.903878 169.0463501 2076.0664 2254
## 77 F58E10.3a.1 SG1_S1 1631.857072 162.3533092 2081.0990 2254
## 78 F58E10.3a.1 SG2_S2 1105.564582 109.2922964 2077.9367 2254
## 79 F58E10.3a.1 SG3_S3 1525.895787 151.8243139 2084.6784 2254
## 80 F58E10.3a.1 SG4_S4 2029.972006 201.3661079 2082.0500 2254
## 81 H06I04.4a.3 N1_S5 3099.411765 1579.8649924 403.1665 569
## 82 H06I04.4a.3 N2_S6 2423.899364 1260.6864924 390.9111 569
## 83 H06I04.4a.3 N3_S7 2963.928704 1520.4520885 397.9816 569
## 84 H06I04.4a.3 N4_S8 2756.949832 1436.6184339 391.1340 569
## 85 H06I04.4a.3 SG1_S1 2876.315745 1503.3278696 396.1454 569
## 86 H06I04.4a.3 SG2_S2 2390.863217 1249.7412749 392.9817 569
## 87 H06I04.4a.3 SG3_S3 3353.600469 1740.4285325 399.6784 569
## 88 H06I04.4a.3 SG4_S4 3239.499468 1684.5215224 397.1813 569
## 89 T15B7.3.1 N1_S5 2920.646145 705.9925346 850.1665 1016
## 90 T15B7.3.1 N2_S6 1914.256717 464.4977440 837.8902 1016
## 91 T15B7.3.1 N3_S7 2832.382988 684.3861468 844.9261 1016
## 92 T15B7.3.1 N4_S8 2352.555344 572.1363752 838.0664 1016
## 93 T15B7.3.1 SG1_S1 2701.732916 663.4918901 843.0990 1016
## 94 T15B7.3.1 SG2_S2 2102.604208 514.2197232 839.9367 1016
## 95 T15B7.3.1 SG3_S3 3355.824898 822.1221360 846.6784 1016
## 96 T15B7.3.1 SG4_S4 3196.240573 782.0936740 844.0500 1016
## 97 T23D8.3.1 N1_S5 166.347901 26.4355311 1293.1665 1459
## 98 T23D8.3.1 N2_S6 724.018260 114.9233172 1280.8902 1459
## 99 T23D8.3.1 N3_S7 33.814217 5.3601427 1287.9261 1459
## 100 T23D8.3.1 N4_S8 304.270388 48.4090081 1281.0664 1459
## 101 T23D8.3.1 SG1_S1 77.325877 12.4486491 1286.0990 1459
## 102 T23D8.3.1 SG2_S2 241.691835 38.6985331 1282.9367 1459
## 103 T23D8.3.1 SG3_S3 91.952906 14.7890254 1289.6784 1459
## 104 T23D8.3.1 SG4_S4 287.247659 46.0944422 1287.0500 1459
## 105 T23G11.3.2 N1_S5 1474.144287 209.7721782 1444.1665 1610
## 106 T23G11.3.2 N2_S6 882.719954 125.3382893 1431.8902 1610
## 107 T23G11.3.2 N3_S7 1335.059190 189.4218115 1438.9261 1610
## 108 T23G11.3.2 N4_S8 1159.022043 164.9554528 1432.0664 1610
## 109 T23G11.3.2 SG1_S1 1266.560639 182.4782349 1437.0990 1610
## 110 T23G11.3.2 SG2_S2 881.281668 126.2474186 1433.9367 1610
## 111 T23G11.3.2 SG3_S3 1512.961711 217.8293599 1440.6784 1610
## 112 T23G11.3.2 SG4_S4 1645.643905 236.3466019 1438.0500 1610
## 113 T25G3.2.1 N1_S5 1517.345302 74.0645382 4210.1665 4376
## 114 T25G3.2.1 N2_S6 1561.714293 75.6381750 4197.8902 4376
## 115 T25G3.2.1 N3_S7 1282.566033 62.2714988 4204.9261 4376
## 116 T25G3.2.1 N4_S8 1900.599036 92.2739969 4198.0664 4376
## 117 T25G3.2.1 SG1_S1 1383.627255 68.1586928 4203.0990 4376
## 118 T25G3.2.1 SG2_S2 1412.406876 69.0803966 4199.9367 4376
## 119 T25G3.2.1 SG3_S3 1256.939224 61.9769904 4206.6784 4376
## 120 T25G3.2.1 SG4_S4 1670.966360 82.0894440 4204.0500 4376
## 121 T27E9.7.2 N1_S5 819.103102 79.4325773 2119.1665 2285
## 122 T27E9.7.2 N2_S6 4.461545 0.4305404 2106.8902 2285
## 123 T27E9.7.2 N3_S7 436.793962 42.1846820 2113.9261 2285
## 124 T27E9.7.2 N4_S8 198.353126 19.1866415 2107.0664 2285
## 125 T27E9.7.2 SG1_S1 527.446351 51.7053865 2112.0990 2285
## 126 T27E9.7.2 SG2_S2 279.093819 27.1846923 2108.9367 2285
## 127 T27E9.7.2 SG3_S3 667.022275 65.3952501 2115.6784 2285
## 128 T27E9.7.2 SG4_S4 663.043344 64.8066582 2113.0500 2285
## 129 W09C5.6a.1 N1_S5 2247.154431 1676.0835773 275.5259 441
## 130 W09C5.6a.1 N2_S6 1860.002662 1433.5727511 263.7937 441
## 131 W09C5.6a.1 N3_S7 2179.098930 1645.2782702 270.3993 441
## 132 W09C5.6a.1 N4_S8 2029.351157 1566.6135290 264.0180 441
## 133 W09C5.6a.1 SG1_S1 2162.922539 1665.1826586 268.9372 441
## 134 W09C5.6a.1 SG2_S2 1850.666306 1430.0596307 265.8347 441
## 135 W09C5.6a.1 SG3_S3 2451.558197 1868.9216647 272.0863 441
## 136 W09C5.6a.1 SG4_S4 2350.637353 1800.7303943 269.6028 441
## 137 Y105E8B.1a.1 N1_S5 1698.336848 264.5753492 1319.1665 1485
## 138 Y105E8B.1a.1 N2_S6 1146.633514 178.3840690 1306.8902 1485
## 139 Y105E8B.1a.1 N3_S7 1480.567172 230.0515332 1313.9261 1485
## 140 Y105E8B.1a.1 N4_S8 1212.536184 189.0754720 1307.0664 1485
## 141 Y105E8B.1a.1 SG1_S1 1443.686139 227.8127123 1312.0990 1485
## 142 Y105E8B.1a.1 SG2_S2 1330.944100 208.8713736 1308.9367 1485
## 143 Y105E8B.1a.1 SG3_S3 1727.984465 272.4241553 1315.6784 1485
## 144 Y105E8B.1a.1 SG4_S4 1732.967261 272.5816261 1313.0500 1485
## 145 Y17G7B.5a.1 N1_S5 529.992749 40.9436232 2660.1665 2826
## 146 Y17G7B.5a.1 N2_S6 1087.027414 83.4662799 2647.8902 2826
## 147 Y17G7B.5a.1 N3_S7 308.421887 23.7170520 2654.9261 2826
## 148 Y17G7B.5a.1 N4_S8 1460.082202 112.3793638 2648.0664 2826
## 149 Y17G7B.5a.1 SG1_S1 812.464553 63.4048869 2653.0990 2826
## 150 Y17G7B.5a.1 SG2_S2 742.520980 57.5586998 2649.9367 2826
## 151 Y17G7B.5a.1 SG3_S3 158.817091 12.3997740 2656.6784 2826
## 152 Y17G7B.5a.1 SG4_S4 530.577577 41.2883258 2654.0500 2826
## 153 Y67D8C.5h.1 N1_S5 1385.349549 21.9841250 12950.1665 13116
## 154 Y67D8C.5h.1 N2_S6 1417.570864 22.2767509 12937.8902 13116
## 155 Y67D8C.5h.1 N3_S7 956.696930 15.0883850 12944.9261 13116
## 156 Y67D8C.5h.1 N4_S8 946.356074 14.9081354 12938.0664 13116
## 157 Y67D8C.5h.1 SG1_S1 1407.698592 22.5186922 12943.0990 13116
## 158 Y67D8C.5h.1 SG2_S2 1045.700973 16.6001916 12939.9367 13116
## 159 Y67D8C.5h.1 SG3_S3 862.738928 13.8221801 12946.6784 13116
## 160 Y67D8C.5h.1 SG4_S4 1451.006474 23.1519005 12944.0500 13116
#Repeat wiith kallisto TPMs
#note that there is only partial correlation between estimated counts and TPMs (TPMs take into account the gene length; est_counts seem to be normalised by sleuth when obs_norm is used)
sleuth_matrix.tpm <- sleuth_to_matrix(so, 'obs_norm', 'tpm')
dim(sleuth_matrix.tpm) #clearly this is counts of transcripts, not genes
## [1] 61428 8
cor(sleuth_matrix.tpm[,1], sleuth_matrix[,1])
## [1] 0.5984869
#will remove using removeVar a % of variables with low variance: we want to keep about 500 transcripts and we have about 60,000 so we will remove 99% of all variables
p.tpm<- pca(sleuth_matrix.tpm, metadata=coldata, removeVar=0.99)
## -- removing the lower 99% of variables based on variance
dim(p.tpm$loadings) #should give about 600 transcripts
## [1] 614 8
#this is essentially the same PCA as by doing: prcomp(t(sleuth_matrix)), scale=FALSE, center=TRUE)
screeplot(p.tpm)
#FIGURE: PCA_tpm_counts
#the PCA plot is virtually identical to the one drawn by sleuth's plot_pca()
biplot(p.tpm, colby="condition", labSize=4.0, widthConnectors=0.2, title="Kallisto TPM; top 1% varied transcripts")
pairsplot(p.tpm, colby= "condition")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
#this command seems to hang for tpm values so I'm not running it
#plotloadings(p)
#instead we can plot the top 20 for PC1 and PC2
rownames(p.tpm$loadings)[order(abs(p$loadings[,1]))][1:20]
## [1] "T04C12.6.4" "F28H1.2.1" "Y48G8AL.8.1" "C06A1.12" "H06I04.4a.3"
## [6] "F53A2.6a.2" "F26E4.6.1" "R144.17" "Y48G10A.9" "ZK632.9.2"
## [11] "F08G2.1.1" "F43E2.5.2" "Y54G2A.75.1" "W03G11.1a.1" "K10B3.7.2"
## [16] "F23H11.5.1" "D1054.10.1" "K02D7.3a.1" "F25H2.5.1" "T23D8.6.1"
t2g[t2g$target_id %in% rownames(p.tpm$loadings)[order(abs(p.tpm$loadings[,1]))][1:20],]
## ens_gene target_id
## 795 WBGene00000465 C07G2.1a.1
## 814 WBGene00000479 C07H6.5.1
## 1936 WBGene00001168 F31E3.5.1
## 2156 WBGene00001243 F41H10.7.1
## 2158 WBGene00001244 F41H10.8.1
## 6889 WBGene00004410 Y62E10A.1.2
## 6942 WBGene00004427 K11H12.2.1
## 11137 WBGene00007235 C01G10.8.1
## 18509 WBGene00012179 W01D2.1.1
## 21924 WBGene00014633 ZK897.4
## 24703 WBGene00016655 C44E4.6.3
## 29271 WBGene00019880 R05D3.6.1
## 33294 WBGene00022582 ZC262.5.1
## 33865 WBGene00023020 F30H5.4
## 33916 WBGene00023071 F56C9.12
## 35569 WBGene00045139 F01D5.12
## 35571 WBGene00045141 Y53F4B.47
## 35585 WBGene00045153 F11A10.9
## 60358 WBGene00219908 F20C5.10
## 60615 WBGene00220165 Y47G6A.36
## description
## 795 Chondroitin proteoglycan 1 [Source:UniProtKB/Swiss-Prot;Acc:Q17802]
## 814 ATP-dependent RNA helicase cgh-1 [Source:UniProtKB/Swiss-Prot;Acc:Q95YF3]
## 1936 Elongation factor 1-alpha [Source:UniProtKB/Swiss-Prot;Acc:P53013]
## 2156 Elongation of very long chain fatty acids protein 5 [Source:UniProtKB/Swiss-Prot;Acc:Q20300]
## 2158 Elongation of very long chain fatty acids protein 6 [Source:UniProtKB/Swiss-Prot;Acc:Q20303]
## 6889 Ribosomal protein, Large subunit, Acidic (P1) [Source:UniProtKB/TrEMBL;Acc:Q9U1X9]
## 6942 60S ribosomal protein L15 [Source:UniProtKB/Swiss-Prot;Acc:P91374]
## 11137 Activator of HSp90 (HSP90) Atpase (ATPase) activity homolog [Source:UniProtKB/TrEMBL;Acc:Q93168]
## 18509 Ribosomal protein L37 [Source:UniProtKB/TrEMBL;Acc:O62388]
## 21924
## 24703 Acyl-CoA-binding protein homolog 1 [Source:UniProtKB/Swiss-Prot;Acc:O01805]
## 29271 Putative ATP synthase subunit epsilon, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P34539]
## 33294 Putative ATP synthase subunit epsilon, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P34539]
## 33865
## 33916
## 35569
## 35571
## 35585
## 60358
## 60615
## ext_gene entrezgene_id
## 795 cpg-1 175586
## 814 cgh-1 176061
## 1936 eef-1A.1 175975
## 2156 elo-5 177320
## 2158 elo-6 177321
## 6889 rla-2 178297
## 6942 rpl-15 176891
## 11137 ahsa-1 180017
## 18509 W01D2.1 175141
## 21924 ZK897.4 NA
## 24703 acbp-1 172071
## 29271 R05D3.6 176178
## 33294 ZC262.5 191132
## 33865 F30H5.4 NA
## 33916 F56C9.12 NA
## 35569 F01D5.12 NA
## 35571 Y53F4B.47 NA
## 35585 F11A10.9 NA
## 60358 F20C5.10 NA
## 60615 Y47G6A.36 NA
rownames(p.tpm$loadings)[order(abs(p.tpm$loadings[,2]))][1:20]
## [1] "ZK813.3.1" "M18.7a.1" "F40F8.10.1" "T05G5.10b.2" "B0280.5.1"
## [6] "T07C4.5.1" "C37A2.7.1" "T10C6.5.1" "Y82E9BR.3.2" "K02F2.11"
## [11] "H13N06.6b.1" "T08B2.10.2" "ZK287.5.2" "C33H5.12a.1" "C53H9.1.1"
## [16] "ZK945.2.2" "R01H2.6.1" "Y49E10.15.1" "C27H5.9" "Y111B2A.18.5"
t2g[t2g$target_id %in% rownames(p.tpm$loadings)[order(abs(p.tpm$loadings[,2]))][1:20],]
## ens_gene target_id
## 259 WBGene00000122 M18.7a.1
## 3564 WBGene00002064 T05G5.10b.2
## 5885 WBGene00003928 ZK945.2.2
## 6706 WBGene00004320 ZK287.5.2
## 6970 WBGene00004441 C53H9.1.1
## 7059 WBGene00004478 F40F8.10.1
## 7073 WBGene00004486 T08B2.10.2
## 7147 WBGene00004700 Y111B2A.18.5
## 7153 WBGene00004703 C33H5.12a.1
## 7515 WBGene00004919 Y49E10.15.1
## 9582 WBGene00006541 H13N06.6b.1
## 9925 WBGene00006713 R01H2.6.1
## 17562 WBGene00011561 T07C4.5.1
## 17729 WBGene00011687 T10C6.5.1
## 22411 WBGene00015102 B0280.5.1
## 24450 WBGene00016493 C37A2.7.1
## 32952 WBGene00022336 Y82E9BR.3.2
## 33639 WBGene00022822 ZK813.3.1
## 35378 WBGene00044938 C27H5.9
## 60449 WBGene00220000 K02F2.11
## description
## 259 Ref/ALY RNA export adaptor family [Source:UniProtKB/TrEMBL;Acc:Q21559]
## 3564 Eukaryotic translation initiation factor 5A-1 [Source:UniProtKB/Swiss-Prot;Acc:P34563]
## 5885 Proteasome subunit alpha type-3 [Source:UniProtKB/Swiss-Prot;Acc:Q09583]
## 6706 RING-box protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q23457]
## 6970 60S ribosomal protein L27 [Source:UniProtKB/Swiss-Prot;Acc:P91914]
## 7059 40S ribosomal protein S9 [Source:UniProtKB/Swiss-Prot;Acc:Q20228]
## 7073 40S ribosomal protein S17 [Source:UniProtKB/Swiss-Prot;Acc:O01692]
## 7147 Probable splicing factor, arginine/serine-rich 3 [Source:UniProtKB/Swiss-Prot;Acc:Q9NEW6]
## 7153 Probable splicing factor, arginine/serine-rich 6 [Source:UniProtKB/Swiss-Prot;Acc:Q18409]
## 7515 Probable small nuclear ribonucleoprotein E [Source:UniProtKB/Swiss-Prot;Acc:Q9XTU6]
## 9582 Tyramine beta-hydroxylase [Source:UniProtKB/Swiss-Prot;Acc:Q9XTQ6]
## 9925 UBiquitin Conjugating enzyme [Source:UniProtKB/TrEMBL;Acc:Q21633]
## 17562 Transthyretin-like protein 15 [Source:UniProtKB/Swiss-Prot;Acc:Q22288]
## 17729 Spliceosome-associated protein CWC15 homolog [Source:UniProtKB/Swiss-Prot;Acc:O45766]
## 22411 Chondroitin proteoglycan-2 [Source:UniProtKB/Swiss-Prot;Acc:P41996]
## 24450 60S acidic ribosomal protein P2 [Source:UniProtKB/Swiss-Prot;Acc:O01504]
## 32952 ATP synthase lipid-binding protein, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:Q9BKS0]
## 33639
## 35378
## 60449
## ext_gene entrezgene_id
## 259 aly-3 178157
## 3564 iff-1 176373
## 5885 pas-7 174571
## 6706 rbx-1 179358
## 6970 rpl-27 171750
## 7059 rps-9 174699
## 7073 rps-17 172313
## 7147 rsp-3 176688
## 7153 rsp-6 177566
## 7515 snr-6 176668
## 9582 tbh-1 181639
## 9925 ubc-18 175985
## 17562 ttr-15 176448
## 17729 cwc-15 180069
## 22411 cpg-2 175991
## 24450 C37A2.7 172401
## 32952 Y82E9BR.3 175299
## 33639 ZK813.3 180634
## 35378 C27H5.9 NA
## 60449 K02F2.11 NA
quantile(sleuth_table_txn$mean_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.7675163 3.5198016 5.7369968 7.8353703 14.8657162
quantile(sleuth_table_txn$var_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.001561586 0.071094574 0.203265678 0.902161600 27.826633120
filter(sleuth_table_txn, target_id %in% rownames(p.tpm$loadings)[order(abs(p.tpm$loadings[,1]))][1:20])
## ens_gene target_id
## 1 WBGene00219908 F20C5.10
## 2 WBGene00019880 R05D3.6.1
## 3 WBGene00022582 ZC262.5.1
## 4 WBGene00016655 C44E4.6.3
## 5 WBGene00007235 C01G10.8.1
## 6 WBGene00001168 F31E3.5.1
## 7 WBGene00001243 F41H10.7.1
## 8 WBGene00000479 C07H6.5.1
## 9 WBGene00001244 F41H10.8.1
## 10 WBGene00045153 F11A10.9
## 11 WBGene00004427 K11H12.2.1
## 12 WBGene00012179 W01D2.1.1
## 13 WBGene00000465 C07G2.1a.1
## 14 WBGene00004410 Y62E10A.1.2
## 15 WBGene00023020 F30H5.4
## 16 WBGene00023071 F56C9.12
## 17 WBGene00220165 Y47G6A.36
## 18 WBGene00014633 ZK897.4
## 19 WBGene00045139 F01D5.12
## 20 WBGene00045141 Y53F4B.47
## description
## 1
## 2 Putative ATP synthase subunit epsilon, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P34539]
## 3 Putative ATP synthase subunit epsilon, mitochondrial [Source:UniProtKB/Swiss-Prot;Acc:P34539]
## 4 Acyl-CoA-binding protein homolog 1 [Source:UniProtKB/Swiss-Prot;Acc:O01805]
## 5 Activator of HSp90 (HSP90) Atpase (ATPase) activity homolog [Source:UniProtKB/TrEMBL;Acc:Q93168]
## 6 Elongation factor 1-alpha [Source:UniProtKB/Swiss-Prot;Acc:P53013]
## 7 Elongation of very long chain fatty acids protein 5 [Source:UniProtKB/Swiss-Prot;Acc:Q20300]
## 8 ATP-dependent RNA helicase cgh-1 [Source:UniProtKB/Swiss-Prot;Acc:Q95YF3]
## 9 Elongation of very long chain fatty acids protein 6 [Source:UniProtKB/Swiss-Prot;Acc:Q20303]
## 10
## 11 60S ribosomal protein L15 [Source:UniProtKB/Swiss-Prot;Acc:P91374]
## 12 Ribosomal protein L37 [Source:UniProtKB/TrEMBL;Acc:O62388]
## 13 Chondroitin proteoglycan 1 [Source:UniProtKB/Swiss-Prot;Acc:Q17802]
## 14 Ribosomal protein, Large subunit, Acidic (P1) [Source:UniProtKB/TrEMBL;Acc:Q9U1X9]
## 15
## 16
## 17
## 18
## 19
## 20
## ext_gene entrezgene_id pval qval b se_b mean_obs
## 1 F20C5.10 <NA> 0.05068472 0.2000746 -1.58155828 0.8093374 3.098760
## 2 R05D3.6 176178 0.12148685 0.3552581 0.34342551 0.2217705 5.483882
## 3 ZC262.5 191132 0.12148685 0.3552581 0.34342551 0.2217705 5.483882
## 4 acbp-1 172071 0.12811055 0.3668523 0.25467582 0.1673742 9.300039
## 5 ahsa-1 180017 0.14879226 0.4021858 -0.44276811 0.3066663 9.764503
## 6 eef-1A.1 175975 0.24474243 0.5357922 -0.41427756 0.3561489 10.715867
## 7 elo-5 177320 0.24937890 0.5412788 -0.24436169 0.2121454 9.587738
## 8 cgh-1 176061 0.25097970 0.5427560 -0.19436809 0.1693142 12.214112
## 9 elo-6 177321 0.35878858 0.6561640 -0.15603208 0.1700296 10.155169
## 10 F11A10.9 <NA> 0.56570210 0.8073614 0.21924588 0.3817003 4.591381
## 11 rpl-15 176891 0.60974626 0.8334502 0.07469306 0.1463320 12.282650
## 12 W01D2.1 175141 0.64650201 0.8536850 -0.14415613 0.3143210 9.719753
## 13 cpg-1 175586 0.67184603 0.8671536 -0.09196471 0.2170946 12.018032
## 14 rla-2 178297 0.67569946 0.8698780 0.51812703 1.2385346 6.641657
## 15 F30H5.4 <NA> 0.72149173 0.8934251 0.06515018 0.1827670 6.214460
## 16 F56C9.12 <NA> 0.82874076 0.9422149 -0.18572648 0.8585853 2.861803
## 17 Y47G6A.36 <NA> 0.88381951 0.9641391 0.06355846 0.4349474 4.351655
## 18 ZK897.4 <NA> NA NA NA NA NA
## 19 F01D5.12 <NA> NA NA NA NA NA
## 20 Y53F4B.47 <NA> NA NA NA NA NA
## var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 1.58019661 0.3868862708 0.622900911 0.92316793 0.92316793
## 2 0.09462516 0.0214216121 0.049660719 0.07694267 0.07694267
## 3 0.09462516 0.0214216121 0.049660719 0.07694267 0.07694267
## 4 0.06655559 0.0060546863 0.049973580 0.01903660 0.04997358
## 5 0.21723116 0.0562966440 0.131791839 0.01991235 0.13179184
## 6 0.26647941 0.1259864136 0.127697596 0.02409404 0.12769760
## 7 0.09421336 0.0383721861 0.051639188 0.01949982 0.05163919
## 8 0.05993791 0.0003430060 0.056991566 0.04112017 0.05699157
## 9 0.05651613 0.0018706787 0.055949467 0.02119982 0.05594947
## 10 0.08722494 0.0902814662 -0.004541950 0.20110883 0.20110883
## 11 0.03734597 0.0004484745 0.041262143 0.04237762 0.04237762
## 12 0.10112001 0.1777970473 -0.066750700 0.01979837 0.01979837
## 13 0.08321084 0.0005755057 0.093684635 0.03781595 0.09368463
## 14 2.70636107 1.0639896436 2.003946395 0.02858724 2.00394639
## 15 0.03039315 0.0282369105 0.005806920 0.03857061 0.03857061
## 16 0.50842750 0.3471005907 0.234566722 1.12723668 1.12723668
## 17 0.09079038 0.1132386610 -0.008663105 0.26511991 0.26511991
## 18 NA NA NA NA NA
## 19 NA NA NA NA NA
## 20 NA NA NA NA NA
#repeating the plot without removing any transcripts makes no real difference
p.tpm0<- pca(sleuth_matrix.tpm, metadata=coldata, removeVar=0)
## -- removing the lower 0% of variables based on variance
dim(p.tpm0$loadings) #should give about 600 transcripts
## [1] 61428 8
biplot(p.tpm0, colby="condition", labSize=4.0, widthConnectors=0.2, title="Kallisto TPM; all transcripts included")
## WHAT HAPPENS IF WE LOG THE VALUES FOR est_counts?
#finally, repeat the PCA but this time using log2-transformed values for the est_counts
p<- pca(log2(sleuth_matrix+0.5), metadata=coldata, removeVar=0.99)
## -- removing the lower 99% of variables based on variance
dim(p$loadings) #should give about 600 transcripts
## [1] 614 8
#this is essentially the same PCA as by doing: prcomp(t(sleuth_matrix)), scale=FALSE, center=TRUE)
screeplot(p)
#in this case the PC2 separates the samples
biplot(p, colby="condition", labSize=4.0, widthConnectors=0.2, title="Kallisto log2(est_counts)")
pairsplot(p, colby= "condition")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
plotloadings(p)
## -- variables retained:
## R03G5.1a.1, Y66D12A.15.3, Y113G7B.23c.1, C26H9A.2c.1, ZK1307.8.4, C07A12.4a.2, Y49A3A.2.2, Y66H1B.2c.3, F36A2.1a.2, F25C8.3e.1, F22D6.5b.1, F10D11.1.2, F48E8.1a.1, C35C5.1.1, Y75B8A.22.1, F53F4.14.1, F13B9.5.1, M03A8.2.2, C25A11.4l.1, C46F11.2b.1, F26H11.5.1, Y110A7A.1b.1, C34E10.5.3, F52G2.1c.1, Y79H2A.3f.1, C25A1.10a.2, Y45F10A.6g.1, F18C12.2a.1
#get the top 20 for PC1 and PC2
rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20]
## [1] "T19B10.3.2" "Y48A6B.11b.1" "Y51F10.2a.1" "F28F8.6.2" "ZK524.2h.1"
## [6] "F35C8.7c.2" "Y105E8B.8a.3" "F55D10.1.2" "C52E12.2b.1" "C23H4.1.4"
## [11] "T24A11.1b.1" "C05D9.1.1" "B0350.2c.1" "B0414.7b.1" "T11G6.2a.1"
## [16] "F54E7.3c.1" "F56A6.2a.1" "F13B9.5.2" "F33D11.9d.1" "Y110A7A.1b.1"
#or getting all the info about these transcripts
t2g[t2g$target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20],]
## ens_gene target_id
## 545 WBGene00000277 C23H4.1.4
## 2258 WBGene00001334 Y105E8B.8a.3
## 3179 WBGene00001833 Y110A7A.1b.1
## 3505 WBGene00002040 F56A6.2a.1
## 3912 WBGene00002239 F13B9.5.2
## 5049 WBGene00003472 B0414.7b.1
## 5054 WBGene00003476 T24A11.1b.1
## 5853 WBGene00003918 F54E7.3c.1
## 7531 WBGene00004927 C05D9.1.1
## 9411 WBGene00006446 F28F8.6.2
## 10041 WBGene00006752 ZK524.2h.1
## 10225 WBGene00006780 B0350.2c.1
## 10474 WBGene00006831 C52E12.2b.1
## 17778 WBGene00011717 T11G6.2a.1
## 17972 WBGene00011832 T19B10.3.2
## 19655 WBGene00012972 Y48A6B.11b.1
## 26594 WBGene00018006 F33D11.9d.1
## 26640 WBGene00018037 F35C8.7c.2
## 27842 WBGene00018877 F55D10.1.2
## 32055 WBGene00021763 Y51F10.2a.1
## description
## 545
## 2258 Endoplasmic reticulum oxidoreductin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q7YTU4]
## 3179 HoloCentric chromosome binding Protein [Source:UniProtKB/TrEMBL;Acc:Q95Y84]
## 3505 Heavy chain, Unconventional Myosin [Source:UniProtKB/TrEMBL;Acc:H2L082]
## 3912 Kinase suppressor of Ras A [Source:UniProtKB/Swiss-Prot;Acc:G5EFD2]
## 5049 MTK1/MEKK4 homolog [Source:UniProtKB/TrEMBL;Acc:O01837]
## 5054 Myotubularin-related protein 3 [Source:UniProtKB/Swiss-Prot;Acc:Q22712]
## 5853 Partitioning defective protein 3 [Source:UniProtKB/Swiss-Prot;Acc:Q17353]
## 7531 Sorting NeXin [Source:UniProtKB/TrEMBL;Acc:Q9GYJ9]
## 9411 Ataxin-3 homolog [Source:UniProtKB/Swiss-Prot;Acc:O17850]
## 10041 Phorbol ester/diacylglycerol-binding protein unc-13 [Source:UniProtKB/Swiss-Prot;Acc:P27715]
## 10225 AO49 ankyrin [Source:UniProtKB/TrEMBL;Acc:Q7JP86]
## 10474 Kinesin-like protein unc-104 [Source:UniProtKB/Swiss-Prot;Acc:P23678]
## 17778
## 17972 Beta-galactosidase [Source:UniProtKB/TrEMBL;Acc:Q27526]
## 19655 Regulator of spindle assembly protein 2 [Source:UniProtKB/Swiss-Prot;Acc:G5EFV3]
## 26594
## 26640 Choline transporter-like protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q20026]
## 27842 Alpha-mannosidase [Source:UniProtKB/TrEMBL;Acc:Q20829]
## 32055
## ext_gene entrezgene_id
## 545 cab-1 181382
## 2258 ero-1 173325
## 3179 hcp-6 266830
## 3505 hum-7 171650
## 3912 ksr-1 181082
## 5049 mtk-1 172246
## 5054 mtm-3 175490
## 5853 par-3 175783
## 7531 snx-1 180468
## 9411 atx-3 180049
## 10041 unc-13 172497
## 10225 unc-44 177366
## 10474 unc-104 174144
## 17778 T11G6.2 177977
## 17972 bgal-1 188583
## 19655 rsa-2 176536
## 26594 hpo-3 172256
## 26640 chtl-1 180809
## 27842 aman-1 180749
## 32055 Y51F10.2 245998
rownames(p$loadings)[order(abs(p$loadings[,2]))][1:20]
## [1] "F46F2.2a.1" "F35C8.7a.2" "F18C12.2a.1" "Y79H2A.3f.1" "B0218.1.4"
## [6] "M03D4.1d.1" "T24B8.7g.1" "C14B9.6c.3" "F29G6.3a.2" "R13H4.4d.1"
## [11] "R12B2.1a.2" "T21D12.9a.1" "R09B5.11.1" "R06F6.9.1" "F32D1.5b.3"
## [16] "C13G3.3b.1" "R11E3.8a.2" "ZK1307.8.2" "C43G2.1a.2" "F32D1.1a.1"
t2g[t2g$target_id %in% rownames(p$loadings)[order(abs(p$loadings[,2]))][1:20],]
## ens_gene target_id
## 1733 WBGene00001058 R11E3.8a.2
## 1904 WBGene00001153 R06F6.9.1
## 2776 WBGene00001565 C14B9.6c.3
## 3358 WBGene00001978 R13H4.4d.1
## 3827 WBGene00002203 F46F2.2a.1
## 6833 WBGene00004378 F18C12.2a.1
## 7414 WBGene00004858 R12B2.1a.2
## 10736 WBGene00006974 M03D4.1d.1
## 11623 WBGene00007554 C13G3.3b.1
## 14169 WBGene00009259 F29G6.3a.2
## 18225 WBGene00011980 T24B8.7g.1
## 20578 WBGene00013580 Y79H2A.3f.1
## 21512 WBGene00014249 ZK1307.8.2
## 22336 WBGene00015047 B0218.1.4
## 24635 WBGene00016610 C43G2.1a.2
## 26541 WBGene00017981 F32D1.1a.1
## 26553 WBGene00017984 F32D1.5b.3
## 26637 WBGene00018037 F35C8.7a.2
## 29422 WBGene00019979 R09B5.11.1
## 30301 WBGene00020649 T21D12.9a.1
## description
## 1733 Dipeptidyl Peptidase Four (IV) family [Source:UniProtKB/TrEMBL;Acc:Q9TYX1]
## 1904 Enoyl-CoA Hydratase [Source:UniProtKB/TrEMBL;Acc:Q09603]
## 2776 Nuclear receptor corepressor 1 [Source:UniProtKB/Swiss-Prot;Acc:P34333]
## 3358 Alpha-catenin-like protein hmp-1 [Source:UniProtKB/Swiss-Prot;Acc:P90947]
## 3827 Casein kinase I isoform delta [Source:UniProtKB/Swiss-Prot;Acc:Q20471]
## 6833 Endocytosis protein RME-8; Receptor Mediated Endocytosis [Source:UniProtKB/TrEMBL;Acc:G5ED36]
## 7414 Dwarfin sma-4 [Source:UniProtKB/Swiss-Prot;Acc:P45897]
## 10736 Kinesin-like protein [Source:UniProtKB/TrEMBL;Acc:Q8MQ03]
## 11623 Serine/threonine protein phosphatase 2A regulatory subunit [Source:UniProtKB/TrEMBL;Acc:A9UJN4]
## 14169
## 18225 Ubiquitin carboxyl-terminal hydrolase [Source:UniProtKB/TrEMBL;Acc:N1NVC2]
## 20578
## 21512
## 22336 Fatty acid amide hydrolase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q17449]
## 24635 Progestin and adipoQ receptor-like protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q94177]
## 26541 Fidgetin-like protein 1 [Source:UniProtKB/Swiss-Prot;Acc:O16299]
## 26553 GMP reductase [Source:UniProtKB/Swiss-Prot;Acc:O16294]
## 26637 Choline transporter-like protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q20026]
## 29422 Facilitated glucose transporter homolog [Source:UniProtKB/Swiss-Prot;Acc:O44616]
## 30301 Leucine-rich repeats and immunoglobulin-like domains protein sma-10 [Source:UniProtKB/Swiss-Prot;Acc:Q965M2]
## ext_gene entrezgene_id
## 1733 dpf-5 177251
## 1904 ech-4 174665
## 2776 gei-8 176145
## 3358 hmp-1 179624
## 3827 kin-20 181620
## 6833 rme-8 172594
## 7414 sma-4 175815
## 10736 zen-4 177374
## 11623 pptr-2 179526
## 14169 hpo-34 181340
## 18225 T24B8.7 174440
## 20578 Y79H2A.3 176638
## 21512 ZK1307.8 174515
## 22336 faah-1 177613
## 24635 paqr-1 177416
## 26541 figl-1 178829
## 26553 F32D1.5 178834
## 26637 chtl-1 180809
## 29422 R09B5.11 178634
## 30301 sma-10 176849
#NOTE: when log2 is used with est_counts, it will be the lower counts that dominate the PCA as they will have the larger variance due to Poisson noise
quantile(sleuth_table_txn$mean_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.7675163 3.5198016 5.7369968 7.8353703 14.8657162
quantile(sleuth_table_txn$var_obs, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0.001561586 0.071094574 0.203265678 0.902161600 27.826633120
filter(sleuth_table_txn, target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20])
## ens_gene target_id
## 1 WBGene00006780 B0350.2c.1
## 2 WBGene00002239 F13B9.5.2
## 3 WBGene00002040 F56A6.2a.1
## 4 WBGene00018006 F33D11.9d.1
## 5 WBGene00001334 Y105E8B.8a.3
## 6 WBGene00001833 Y110A7A.1b.1
## 7 WBGene00011832 T19B10.3.2
## 8 WBGene00006446 F28F8.6.2
## 9 WBGene00003476 T24A11.1b.1
## 10 WBGene00003472 B0414.7b.1
## 11 WBGene00000277 C23H4.1.4
## 12 WBGene00004927 C05D9.1.1
## 13 WBGene00018877 F55D10.1.2
## 14 WBGene00018037 F35C8.7c.2
## 15 WBGene00012972 Y48A6B.11b.1
## 16 WBGene00003918 F54E7.3c.1
## 17 WBGene00006831 C52E12.2b.1
## 18 WBGene00021763 Y51F10.2a.1
## 19 WBGene00006752 ZK524.2h.1
## 20 WBGene00011717 T11G6.2a.1
## description
## 1 AO49 ankyrin [Source:UniProtKB/TrEMBL;Acc:Q7JP86]
## 2 Kinase suppressor of Ras A [Source:UniProtKB/Swiss-Prot;Acc:G5EFD2]
## 3 Heavy chain, Unconventional Myosin [Source:UniProtKB/TrEMBL;Acc:H2L082]
## 4
## 5 Endoplasmic reticulum oxidoreductin-1 [Source:UniProtKB/Swiss-Prot;Acc:Q7YTU4]
## 6 HoloCentric chromosome binding Protein [Source:UniProtKB/TrEMBL;Acc:Q95Y84]
## 7 Beta-galactosidase [Source:UniProtKB/TrEMBL;Acc:Q27526]
## 8 Ataxin-3 homolog [Source:UniProtKB/Swiss-Prot;Acc:O17850]
## 9 Myotubularin-related protein 3 [Source:UniProtKB/Swiss-Prot;Acc:Q22712]
## 10 MTK1/MEKK4 homolog [Source:UniProtKB/TrEMBL;Acc:O01837]
## 11
## 12 Sorting NeXin [Source:UniProtKB/TrEMBL;Acc:Q9GYJ9]
## 13 Alpha-mannosidase [Source:UniProtKB/TrEMBL;Acc:Q20829]
## 14 Choline transporter-like protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q20026]
## 15 Regulator of spindle assembly protein 2 [Source:UniProtKB/Swiss-Prot;Acc:G5EFV3]
## 16 Partitioning defective protein 3 [Source:UniProtKB/Swiss-Prot;Acc:Q17353]
## 17 Kinesin-like protein unc-104 [Source:UniProtKB/Swiss-Prot;Acc:P23678]
## 18
## 19 Phorbol ester/diacylglycerol-binding protein unc-13 [Source:UniProtKB/Swiss-Prot;Acc:P27715]
## 20
## ext_gene entrezgene_id pval qval b se_b mean_obs
## 1 unc-44 177366 0.001463698 0.01481936 5.03306203 1.581835 3.414347
## 2 ksr-1 181082 0.105855253 0.32695448 -3.48457088 2.154818 2.935111
## 3 hum-7 171650 0.109653317 0.33470561 3.49763201 2.186357 2.229330
## 4 hpo-3 172256 0.111081842 0.33724694 -3.44013593 2.159059 2.137390
## 5 ero-1 173325 0.111532269 0.33770854 3.53290187 2.220075 2.365086
## 6 hcp-6 266830 0.115536196 0.34571721 -4.15334872 2.639076 2.847424
## 7 bgal-1 188583 0.121276485 0.35502790 3.88599406 2.508002 2.628493
## 8 atx-3 180049 0.327906125 0.62781288 -2.18484553 2.233217 2.283616
## 9 mtm-3 175490 0.382496351 0.67778238 2.24459925 2.570232 3.969596
## 10 mtk-1 172246 0.710478150 0.88816775 -1.09186496 2.941336 3.657801
## 11 cab-1 181382 0.841087031 0.94659276 0.50514161 2.519369 4.326480
## 12 snx-1 180468 0.842984258 0.94717668 -0.59891670 3.023645 2.685658
## 13 aman-1 180749 0.876144898 0.96123011 -0.45787971 2.937799 2.721932
## 14 chtl-1 180809 0.882680362 0.96349770 0.36602444 2.480307 2.693936
## 15 rsa-2 176536 0.906500062 0.97188504 0.30321605 2.581566 2.253511
## 16 par-3 175783 0.925036035 0.97777186 -0.24556365 2.609824 2.188864
## 17 unc-104 174144 0.942914782 0.98376401 -0.18951163 2.646557 2.850742
## 18 Y51F10.2 245998 0.993419412 0.99763789 0.02787542 3.379807 3.377456
## 19 unc-13 172497 NA NA NA NA NA
## 20 T11G6.2 177977 NA NA NA NA NA
## var_obs tech_var sigma_sq smooth_sigma_sq final_sigma_sq
## 1 11.527123 3.819028 1.185379 0.7017787 1.185379
## 2 11.429052 3.776768 5.509715 1.0604238 5.509715
## 3 11.689821 4.466722 5.093593 1.8505943 5.093593
## 4 11.372503 5.797378 3.525697 1.9790858 3.525697
## 5 12.015368 5.042340 4.815124 1.6723276 4.815124
## 6 16.868179 4.818131 9.111310 1.1407393 9.111310
## 7 15.097545 3.657732 8.922421 1.3628281 8.922421
## 8 9.913455 7.513567 2.460947 1.7777063 2.460947
## 9 12.764227 6.506484 6.705706 0.4068814 6.705706
## 10 15.171686 4.130494 13.172417 0.5606838 13.172417
## 11 10.953854 4.838679 7.855761 0.2729035 7.855761
## 12 15.775221 3.256951 15.027906 1.3018105 15.027906
## 13 14.855325 6.105803 11.155525 1.2642113 11.155525
## 14 10.584430 2.992747 9.311097 1.2931537 9.311097
## 15 11.451100 5.720018 7.608952 1.8178592 7.608952
## 16 11.693536 4.938084 8.684274 1.9063537 8.684274
## 17 12.017571 5.873822 8.134705 1.1376122 8.134705
## 18 19.582673 6.020286 16.825906 0.7248982 16.825906
## 19 NA NA NA NA NA
## 20 NA NA NA NA NA
#the distribution of est_counts (not logged) for the important loadings in PC1
quantile(filter(so[["obs_norm"]], target_id %in% rownames(p$loadings)[order(abs(p$loadings[,1]))][1:20])$est_counts)
## 0% 25% 50% 75% 100%
## 0.000000e+00 8.916271e-05 1.499096e+01 5.880107e+01 2.882697e+02
#the distribution of est_counts (not_logged) across all transcripts
quantile(so[["obs_norm"]]$est_counts)
## 0% 25% 50% 75% 100%
## 0.000000 0.000000 0.000000 5.034019 36437.874329
#Some additional exploratory analysis using sleuth's own functions
plot_pca(so,
color_by = 'condition',
text_labels = TRUE)
#the two types of samples are separated but only just (the SGs cluster together though)
plot_pca(so,
color_by = 'condition',
text_labels = TRUE, units="tpm")
plot_pca(so,
pc_x=2,
pc_y=3,
color_by = 'condition',
text_labels = TRUE)
plot_pca(so,
pc_x=2,
pc_y=8,
color_by = 'condition',
text_labels = TRUE)
#how much of the variance is accounted for by each PC?
plot_pc_variance(obj = so, units = "est_counts")
plot_pc_variance(obj = so, units = "tpm")
#note PC8 also separates the samples
plot_loadings(so, pc_input=1)
plot_loadings(so, pc_input=2)
plot_loadings(so, pc_input=3)
plot_loadings(so, pc_input=8)
#note that all influential genes in PCA seem to be very highly expressed (over 1000 tpm)
plot_group_density(so,
use_filtered = FALSE,
units = "est_counts",
trans = "log",
grouping = "condition")
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `summarise_()` is deprecated as of dplyr 0.7.0.
## Please use `summarise()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_group_density(so,
use_filtered = TRUE,
units = "est_counts",
trans = "log",
grouping = "condition")
plot_sample_heatmap(so)
#check whether there are sex differences
#list of sex determining genes is from:
#https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2619291/pdf/324.pdf
list.sex_genes <- c("xol-1", "sdc-1", "sdc-2",
"her-1", "tra-2", "tra-3",
"fem-1", "fem-2", "fem-3", "tra-1")
plot_transcript_heatmap(so,
transcripts=dplyr::filter(sleuth_table_txn,
ext_gene %in% list.sex_genes)$target_id)
#repeat the plot with est_counts instead of tpm
plot_transcript_heatmap(so,
transcripts=dplyr::filter(sleuth_table_txn,
ext_gene %in% list.sex_genes)$target_id,
units="est_counts")
#ZK287.8b.1 seems to separate two of the normal samples but it's a transcript that
#is not used in differential expression due to filtering - below are original counts
#note that basic filter in sleuth filters out any transcripts with fewer than 5 estimated counts
#in 47% of the samples (effectively here half of the samples)
so$obs_raw[so$obs_norm$target_id == "ZK287.8b.1",]
## target_id est_counts eff_len len tpm sample
## 480169 ZK287.8b.1 0.000000 54.84584 196 0.00000 N1_S5
## 480170 ZK287.8b.1 3.000000 51.71815 196 11.69946 N2_S6
## 480171 ZK287.8b.1 3.825966 52.92062 196 16.27609 N3_S7
## 480172 ZK287.8b.1 0.000000 51.52067 196 0.00000 N4_S8
## 480173 ZK287.8b.1 0.000000 55.92125 196 0.00000 SG1_S1
## 480174 ZK287.8b.1 0.000000 53.04100 196 0.00000 SG2_S2
## 480175 ZK287.8b.1 0.000000 54.34783 196 0.00000 SG3_S3
## 480176 ZK287.8b.1 0.000000 53.99656 196 0.00000 SG4_S4
A note on the variability of PCA plots We have plotted a number of PCA plots in the code above. Most look similar but there are also some striking differences. What drives most of the differences is that some of the plots have used log-transformed values of counts and others haven't. The est_counts in the so object's obs_norm and obs_raw data frames are not log normalised (however when they are shown in the results table, the mean observations are indeed logged (in this case transformed by log2 as this is how we built the object)). Hence, when we are using est_counts extracted with sleuth_to_matrix, these counts are not in a log scale. Some are indeed very large and will dominate the PCA plot (they also have very large variance in RNA-seq). If we log these values, then the lower counts will dominate the PCA as they will have the larger variance due to Poisson noise.
date()
## [1] "Wed Jul 1 11:10:24 2020"
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] PCAtools_1.2.0 cowplot_1.0.0
## [3] lattice_0.20-41 reshape2_1.4.4
## [5] apeglm_1.8.0 DESeq2_1.26.0
## [7] SummarizedExperiment_1.16.1 DelayedArray_0.12.3
## [9] BiocParallel_1.20.1 matrixStats_0.56.0
## [11] pheatmap_1.0.12 EnhancedVolcano_1.4.0
## [13] ggrepel_0.8.2 Rgraphviz_2.30.0
## [15] topGO_2.38.1 SparseM_1.78
## [17] graph_1.64.0 org.Ce.eg.db_3.10.1
## [19] GO.db_3.10.0 goseq_1.38.0
## [21] geneLenDataBase_1.22.0 BiasedUrn_1.07
## [23] reactome.db_1.70.0 AnnotationDbi_1.48.0
## [25] Biobase_2.46.0 biomaRt_2.42.1
## [27] ggplot2_3.3.2 dplyr_1.0.0
## [29] tximport_1.14.2 sleuth_0.30.0
## [31] tidyr_1.1.0 rtracklayer_1.46.0
## [33] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
## [35] IRanges_2.20.2 S4Vectors_0.24.4
## [37] BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.8 Hmisc_4.4-0 BiocFileCache_1.10.2
## [4] plyr_1.8.6 lazyeval_0.2.2 splines_3.6.3
## [7] digest_0.6.25 htmltools_0.5.0 fansi_0.4.1
## [10] magrittr_1.5 checkmate_2.0.0 memoise_1.1.0
## [13] cluster_2.1.0 Biostrings_2.54.0 annotate_1.64.0
## [16] askpass_1.1 bdsmatrix_1.3-4 prettyunits_1.1.1
## [19] jpeg_0.1-8.1 colorspace_1.4-1 blob_1.2.1
## [22] rappdirs_0.3.1 xfun_0.14 crayon_1.3.4
## [25] RCurl_1.98-1.2 genefilter_1.68.0 survival_3.2-3
## [28] glue_1.4.1 gtable_0.3.0 zlibbioc_1.32.0
## [31] XVector_0.26.0 BiocSingular_1.2.2 Rhdf5lib_1.8.0
## [34] scales_1.1.1 mvtnorm_1.1-1 DBI_1.1.0
## [37] Rcpp_1.0.4 xtable_1.8-4 progress_1.2.2
## [40] emdbook_1.3.12 htmlTable_2.0.0 dqrng_0.2.1
## [43] rsvd_1.0.3 foreign_0.8-75 bit_1.1-15.2
## [46] Formula_1.2-3 htmlwidgets_1.5.1 httr_1.4.1
## [49] RColorBrewer_1.1-2 acepack_1.4.1 ellipsis_0.3.1
## [52] pkgconfig_2.0.3 XML_3.99-0.3 farver_2.0.3
## [55] nnet_7.3-14 dbplyr_1.4.4 locfit_1.5-9.4
## [58] utf8_1.1.4 tidyselect_1.1.0 labeling_0.3
## [61] rlang_0.4.6 munsell_0.5.0 tools_3.6.3
## [64] cli_2.0.2 generics_0.0.2 RSQLite_2.2.0
## [67] evaluate_0.14 stringr_1.4.0 yaml_2.2.1
## [70] knitr_1.28 bit64_0.9-7 purrr_0.3.4
## [73] nlme_3.1-148 compiler_3.6.3 rstudioapi_0.11
## [76] curl_4.3 png_0.1-7 tibble_3.0.1
## [79] geneplotter_1.64.0 stringi_1.4.6 GenomicFeatures_1.38.2
## [82] Matrix_1.2-18 vctrs_0.3.1 pillar_1.4.4
## [85] lifecycle_0.2.0 irlba_2.3.3 data.table_1.12.8
## [88] bitops_1.0-6 R6_2.4.1 latticeExtra_0.6-29
## [91] gridExtra_2.3 MASS_7.3-51.6 assertthat_0.2.1
## [94] rhdf5_2.30.1 openssl_1.4.1 withr_2.2.0
## [97] GenomicAlignments_1.22.1 Rsamtools_2.2.3 GenomeInfoDbData_1.2.2
## [100] mgcv_1.8-31 hms_0.5.3 rpart_4.1-15
## [103] coda_0.19-3 DelayedMatrixStats_1.8.0 rmarkdown_2.3
## [106] aggregation_1.0.1 bbmle_1.0.23.1 numDeriv_2016.8-1.1
## [109] base64enc_0.1-3
#save.image(file="neurexin_analysis.Rdata")